# **Case Study: How Does a Bike-Share Navigate Speedy Success?**
## Background:
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station andreturned to any other station in the system anytime.

Financial analysts in the company concluded that annual members are much more profitable than casual riders and advise top management to invest in a marketing strategy to convert casual riders to annual members through subscription. As a result, the marketing analysts need to better understand rider needs and trends from Cyclistic historical bike trip data.

We will go through the phases of Data Analysis i.e. Ask, Prepare, Process, Analyze to advise whether the investment in the marketing strategy would be a good decision.
### 1. Ask 
#### What problem is being solved?
The business task at hand is to find out whether this marketing strategy is worth investing in. To get started, an analysis of the difference in usage of Cyclistic bikes between annual members and casual riders needs to be done.
### Consider key stakeholders 
The key stakeholders are:
* Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
* Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
* Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
* Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program.



### 2. Prepare
We will use Cyclistic’s historical trip data to analyze and identify trends. This is public data that we can use to explore how different customer types are using Cyclistic bikes.
First step is to import the datasets into R:

In [1]:
#Load all necessary packages
library('tidyverse')
library('lubridate')
library('ggplot2')
library('dplyr')
library('tidyr')
library('readr')
library('tibble')
library('stringr')
library('forcats')
library('readxl')

#Import data
nov2020 <- read_csv("Desktop/data/202011-divvy-tripdata.csv")
dec2020 <- read_csv("Desktop/data/202012-divvy-tripdata.csv")
jan2021 <- read_csv("Desktop/data/202101-divvy-tripdata.csv")
feb2021 <- read_csv("Desktop/data/202102-divvy-tripdata.csv")
mar2021 <- read_csv("Desktop/data/202103-divvy-tripdata.csv")
apr2021 <- read_csv("Desktop/data/202104-divvy-tripdata.csv")
may2021 <- read_csv("Desktop/data/202105-divvy-tripdata.csv")
june2021 <- read_csv("Desktop/data/202106-divvy-tripdata.csv")
july2021 <- read_csv("Desktop/data/202107-divvy-tripdata.csv")
aug2021 <- read_csv("Desktop/data/202108-divvy-tripdata.csv")
sep2021 <- read_csv("Desktop/data/202109-divvy-tripdata.csv")
oct2021 <- read_csv("Desktop/data/202110-divvy-tripdata.csv")


NameError: name 'library' is not defined

### 3. Process
All the datasets had the same number of columns i.e. ride_id, rideable_type, started_at, ended_at, start_station_name, start_station_id, end_station_name, end_station_id, start_lat, start_lng, end_lat, end_lng, member_casual\
However, it is worth noting that start_station_id and end_station_id for datasets in 2021 were of string data type while those in 2020 were of Double (integer) data type. In order to combine all datasets, we need to have the fields to have the same data types. We will change the data types start_station_id and end_station_id for datasets in 2020 from double (integer) to string.

In [None]:
#Mutate start_station_id and end_station_id for datasets in 2020 from integer to character
nov2020 <- mutate(nov2020, start_station_id=as.character(start_station_id), end_station_id=as.character(end_station_id))
dec2020 <- mutate(dec2020, start_station_id=as.character(start_station_id), end_station_id=as.character(end_station_id))

#Combine all datasets together to dataset called all_trips
all_trips <- rbind(nov2020, dec2020, jan2021, feb2021, mar2021, apr2021, may2021, june2021, july2021, aug2021, sep2021, oct2021)


As part of preparing the data, we don't need fields such as start_lat, end_lat, start_lng and end_lng. We will recreate the same dataset without these fields



In [None]:
#remove the latitute and longitude fields
all_trips <- all_trips %>%  
  select(-c(start_lat, start_lng, end_lat, end_lng))

#inspect the dataset to see if the fields have been removed
colnames(all_trips)


### 4. Analyze  
Now that we have a single dataset, we can improve our analysis by adding a calculated field to check on the trip duration (ride_length) and separating the date fields to day, month and year. As we carry out the analysis, we will eliminate negative trip durations. We will first start by adding columns for the date, the day of the week, the day, the month and the year. We will then add the calculated field for trip duration i.e. ride_length

In [None]:
#add columns for date, day of the week, day, month and year
all_trips$date <- as.Date(all_trips$started_at) 
all_trips$day_of_week <- format(as.Date(all_trips$date), "%A")
all_trips$day <- format(as.Date(all_trips$date), "%d")
all_trips$month <- format(as.Date(all_trips$date), "%m")
all_trips$year <- format(as.Date(all_trips$date), "%Y")

#add calculated field for ride_length
all_trips<-all_trips %>%
mutate (ride_length=difftime(ended_at, started_at))

# Convert "ride_length" to numeric so help us perform calculations on data
all_trips$ride_length <- as.numeric(as.character(all_trips$ride_length))

We want to concentrate on trip durations that are greater than zero. We will therefore, create a new dataset specifying trip durations (ride_length) greater than 0

In [None]:
all_tripsv2 <- all_trips[!(all_trips$ride_length <= 0),]

We can get mean, median, max and minimum ride_length by using the summary() function as below:



In [None]:
summary(all_tripsv2$ride_length)

We can further do comparison of the mean, median, maximum and minimum ride_length by the type of rider i.e. member or casual.

In [None]:
aggregate(all_tripsv2$ride_length ~ all_tripsv2$member_casual, FUN = mean)
aggregate(all_tripsv2$ride_length ~ all_tripsv2$member_casual, FUN = median)
aggregate(all_tripsv2$ride_length ~ all_tripsv2$member_casual, FUN = max)
aggregate(all_tripsv2$ride_length ~ all_tripsv2$member_casual, FUN = min)

Generally, the average trip duration for casual riders is higher (1953.644) than that of members who have a subscription (837.854)

We can further analyze the data by finding out the average ride length for each type of rider (casual or member) by the day of the week.

In [None]:
#Order the days of the week starting with Sunday
all_tripsv2$day_of_week <- ordered(all_tripsv2$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

#average ride time by each day for members vs casual
aggregate(all_tripsv2$ride_length ~ all_tripsv2$member_casual + all_tripsv2$day_of_week, FUN = mean)


The day with the highest average trip duration for casual riders was Thursday (3142.16) while the day with the highest average trip duration for members was Sunday (1139.29)

We can also summarize the ridership data by type of rider and weekday while at the same time calculcate the number of rides and the average duration

In [None]:
all_tripsv2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  #creates weekday field using wday()
  group_by(member_casual, weekday) %>%  #groups by usertype and weekday
  summarise(number_of_rides = n()  #calculates the number of rides  
  ,average_duration = mean(ride_length)) %>% # calculates the average duration
  arrange(member_casual, weekday)  # sorts

The above data can be visualized using comparative bar graphs where we plot the weekday against the number of rides.

In [None]:
all_tripsv2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge")

From the above bar graph, we can conclude that rides for casual riders are highest over the weekends while rides for subscription members are highest during working days.

We can also create comparative bar graphs for average duration of rides

In [None]:
all_tripsv2 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge")

Generally, the trip duration for casual riders is higher than the trip duration for subscription members. So the company should focus more on casual riders