# Cyclistic Bike-Share Analysis

In [1]:
# I originally did this project on Kaggle 
#This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Background
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

# Business Task
The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, my team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights,the team will design a new marketing strategy to convert casual riders into annual members and present it to the Cyclistic executives for approval.

## Datasets
I used Cyclistic’s historical trip data to analyze and identify trends. The last 12 months(April 2021 - March 2022) of Cyclistic trip data  was downloaded from [here](https://divvy-tripdata.s3.amazonaws.com/index.html). (Note: The datasets have a different name because Cyclistic is a fictional company. For the purposes of this case study,
the datasets are appropriate and will enable you to answer the business questions. The data has been made available by Motivate International Inc. under this [license](https://www.divvybikes.com/data-license-agreement).) It is public data that you can use to explore how different customer types are using Cyclistic bikes, made available by. 



## Load Packages

In [2]:
library(tidyverse)
library(dplyr)                                    
library(plyr)                                     
library(readr)                                    
library(purrr)
library(ggplot2)
library(lubridate)

## Data Collection

The 12 extracted historical data files are in CSV format. 

Upload Cyclistic datasets (csv files).

In [3]:
apr21 <- read_csv("../input/202103-202203-cyclistic-dataset/202104-divvy-tripdata.csv")
may21 <- read_csv("../input/202103-202203-cyclistic-dataset/202105-divvy-tripdata.csv")
jun21 <- read_csv("../input/202103-202203-cyclistic-dataset/202106-divvy-tripdata.csv")
jul21 <- read_csv("../input/202103-202203-cyclistic-dataset/202107-divvy-tripdata.csv")
aug21 <- read_csv("../input/202103-202203-cyclistic-dataset/202108-divvy-tripdata.csv")
sep21 <- read_csv("../input/202103-202203-cyclistic-dataset/202109-divvy-tripdata.csv")
oct21 <- read_csv("../input/202103-202203-cyclistic-dataset/202110-divvy-tripdata.csv")
nov21 <- read_csv("../input/202103-202203-cyclistic-dataset/202111-divvy-tripdata.csv")
dec21 <- read_csv("../input/202103-202203-cyclistic-dataset/202112-divvy-tripdata.csv")
jan22 <- read_csv("../input/202103-202203-cyclistic-dataset/202201-divvy-tripdata.csv")
feb_22 <- read_csv("../input/202103-202203-cyclistic-dataset/202202-divvy-tripdata.csv")
mar22 <- read_csv("../input/202103-202203-cyclistic-dataset/202203-divvy-tripdata.csv")

### Wrangle and Merge Data into a single file

Inspect the dataframes and look for incongruencies.


In [None]:
str(apr21)
str(may21)
str(jun21)
str(jul21)
str(aug21)
str(sep21)
str(oct21)
str(nov21)
str(dec21)
str(jan22)
str(feb22)
str(mar22)

There are no incongruencies as all columns have matching names the right data types. So I then used merged all 12 CSV files into a single dataset, `trip_data`.

In [8]:
trip_data <- list.files(path = "/Users/iClin/Downloads/data",  
                        pattern = "*.csv", full.names = TRUE) %>%          # Identify all CSV files
  lapply(read_csv) %>%                               # Store all files in list
  bind_rows                                          # Combine data sets into one data set                                                                   

In [9]:
as.data.frame(trip_data)     # Convert tibble to data.frame

### Data Cleaning and Manipulation

Inspect the new merged dataset.

In [10]:
colnames(trip_data)  # List of column names

In [11]:
nrow(trip_data)  # How many rows are in the data frame?

In [12]:
dim(trip_data)  # Dimensions of the data frame?


In [13]:
head(trip_data)  # See the first 6 rows of data frame.  


In [14]:
tail(trip_data)   # See the last 6 rows of data frame.  

In [15]:
str(trip_data)  # See list of columns and data types (numeric, character, double, etc)


In [16]:
summary(trip_data)  # Statistical summary of data. Mainly for numerics.

Check for unique user types in the `member_casual` column.

In [17]:
unique(trip_data$member_casual)

I added columns that list the `date`, `month`, `day`, and `year` of each ride.
This allowed me to aggregate ride data for each month, day, or year and day of the week. Before completing these operations we could only aggregate at the ride level.

In [18]:
trip_data$date <- as.Date(trip_data$started_at)         #The default format is yyyy-mm-dd
trip_data$month <- format(as.Date(trip_data$date), "%m")
trip_data$day <- format(as.Date(trip_data$date), "%d")
trip_data$year <- format(as.Date(trip_data$date), "%Y")
trip_data$day_of_week <- format(as.Date(trip_data$date), "%A")

In [19]:
colnames(trip_data)  # List of column names

The duration of each ride(in seconds) was then calculated and placed in a new column, `ride_legth`.


In [20]:
trip_data <- mutate(trip_data, ride_length = difftime(ended_at, started_at))

Inspect the structure of the columns.


In [21]:
str(trip_data)

Convert `ride_length` from Factor to numeric in order to be able to run calculations on the data.


In [22]:
trip_data$ride_length <- as.numeric(as.character(trip_data$ride_length))
is.numeric(trip_data$ride_length)

The dataframe includes a few hundred entries when bikes were taken out of docks and checked for quality (`start_station_name` "*HQ QR*") by Cyclistic or `ride_length` was negative. So I removed this bad data by creating a new version of the dataset.

In [23]:
trip_data_v2 <- trip_data[!(trip_data$start_station_name == "HQ QR" | trip_data$ride_length<0),]

In [24]:
unique(trip_data_v2$member_casual)

Ensure there are no null values in `member_casual`.

In [25]:
trip_data_v2 <- trip_data_v2[!(is.na(trip_data_v2$member_casual)), ]

##### Dataset Limitation
The main limitation of the dataset is that, it has a lot of missing values. Such a huge amount of missing data can affect the analysis, leading to inaccurate conclusions

Check for null values in all columns.

In [26]:
na_count <- data.frame(map_dbl(trip_data_v2, ~sum(is.na(.))))    # Missing values per column
na_count

Seeing as there are many missing values for the `end_station_name` and `end_station_id`, I decided to drop the missing values since this will not affect the business task.

In [27]:
trip_data_v3 <- trip_data_v2[!(is.na(trip_data_v2$end_station_name) | trip_data_v2$end_station_name=="" | 
                              is.na(trip_data_v2$start_station_name) | trip_data_v2$start_station_name==""|
                              is.na(trip_data_v2$ride_id) | trip_data_v2$ride_id==""),] 

In [28]:
unique(trip_data_v3$member_casual)
dim(trip_data_v3)

Check for duplicate rides.

In [29]:
trip_data_v3[duplicated (trip_data_v3$ride_id),]

I found out that there are duplicate `ride_id`s. So I removed the duplicates rides.

In [30]:
trip_data_v4 <- trip_data_v3[!duplicated(trip_data_v3$ride_id),]

In [31]:
as.data.frame(trip_data_v4)     # Convert tibble to data.frame

### Descriptive Analysis

Descriptive analysis on `ride_length` (all figures in seconds).

On average, a casual ride lasted for **1922.9215** seconds(**32.0487** minutes)while an annual member ride lasted **776.4847** seconds (**12.9414** minutes).

In [32]:
mean(trip_data_v4$ride_length) # straight average (total ride length / rides)
median(trip_data_v4$ride_length) # midpoint number in the ascending array of ride lengths
max(trip_data_v4$ride_length) # longest ride
min(trip_data_v4$ride_length) # shortest ride

mode<-function(x){which.max(tabulate(x))}
mode(trip_data_v4$ride_length)

In [33]:
summary(trip_data_v4$ride_length)

#### Compare members and casual users.

In [34]:
aggregate(data = trip_data_v4, ride_length ~ member_casual, FUN = mean)
aggregate(data = trip_data_v4, ride_length ~ member_casual, FUN = median)
aggregate(data = trip_data_v4, ride_length ~ member_casual, FUN = max)
aggregate(data = trip_data_v4, ride_length ~ member_casual, FUN = min)

Calculate the **average ride duration** by each day for members vs casual users.[[](http://)](http://)


In [35]:
aggregate(data = trip_data_v4, ride_length ~ member_casual + day_of_week, FUN = mean)

In the above output, I noticed that the days of the week are out of order. So, I fixed that as follows.

In [36]:
trip_data_v4$day_of_week <- ordered(trip_data_v4$day_of_week, 
                                     levels=c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))

Now re-run the aggregation.

In [37]:
aggregate(data = trip_data_v4, ride_length ~ member_casual + day_of_week, FUN = mean)

#### Analyzing ridership data by type and weekday.

In [38]:
trip_data_v4 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>%  # creates weekday field using wday()
  group_by(member_casual, weekday) %>%                  # groups by usertype and weekday
  dplyr::summarise(number_of_rides = n(),                      # calculates the number of rides and average duration             
    average_duration = mean(ride_length)) %>%           # calculates the average duration
  arrange(member_casual, weekday)

#### Number of rides by user type.

The annual members took **552,691** more rides than the casual users.

In [39]:
trip_data_v4 %>% 
    group_by(member_casual) %>% 
    dplyr::summarise(number_of_rides = n()) %>% 
    arrange(member_casual)  %>% 
    ggplot(aes(x = member_casual, y = number_of_rides, fill = member_casual)) +
    geom_col(position = "dodge") +
    geom_text(aes(label=round(stat(y),2)), vjust=+2, color="white")+
    labs(title="Number of rides by user type", 
       subtitle="From April 2021 to March 2022",
       x="User type", y="Number of rides")

#### Visualizing the number of rides by rider type.

* On week days, members take more rides that casual users.
* On weekends, casual users take more rides that members.

In [None]:
trip_data_v4 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  dplyr::summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title="Number of rides by rider type", 
       subtitle="From April 2021 to March 2022",
       x="Weekday", y="Number of rides")

#### Average ride duration by user type.

On average, the rides of casual users lasted for **1146.44** seconds (32.12 minutes) more than those of members.


In [40]:
trip_data_v4 %>% 
     group_by(member_casual) %>% 
     dplyr::summarise(average_duration = mean(ride_length)) %>% 
     arrange(member_casual)  %>% 
     ggplot(aes(x = member_casual, y = average_duration, fill = member_casual)) +
     geom_col(position = "dodge") +
     geom_text(aes(label=round(stat(y),2)/60), vjust=+2, color="white")+
     labs(title="Average ride duration by user type", 
          subtitle="From April 2021 to March 2022",
          x="User type", y="Average ride duration")

#### Visualizing the average ride duration by rider type.

This reveals the fact that, although members take more rides on weekdays than casual users, on average casual users rides Cyclistic bikes for longer durations every day of the week.


In [None]:
trip_data_v4 %>% 
  mutate(weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  dplyr::summarise(number_of_rides = n()
            ,average_duration = mean(ride_length)) %>% 
  arrange(member_casual, weekday)  %>% 
  ggplot(aes(x = weekday, y = average_duration, fill = member_casual)) +
  geom_col(position = "dodge") +
  labs(title="Average ride duration by rider type", 
       subtitle="From April 2021 to March 2022",
       x="Weekday", y="Average ride duration")+
  facet_wrap(~member_casual)

#### Number of rides by bike type.

* Both members and casual users ride classic bikes the most.
* Casual users ride more docked bikes than members.
* Both members and casual users ride more electric and classic bikes than docked bikes.
* Both members and casual users ride docked bikes least.

In [41]:
trip_data_v4 %>% 
  group_by(member_casual, rideable_type) %>% 
  dplyr::summarise(number_of_rides = n()) %>%
  ggplot(aes(x = rideable_type, y = number_of_rides, fill = member_casual)) +
  geom_col(position = "stack")+
  labs(title="Number of rides by bike type", 
       subtitle="From April 2021 to March 2022",
       x="Bike type", y="Number of rides") + theme_bw(base_size = 15)

## Key findings
* On average, a casual ride lasted for **1922.9215** seconds(**32.0487** minutes)while an annual member ride lasted **776.4847** seconds (**12.9414** minutes).
* The annual members took **552,691** more rides than the casual users.
* On week days, members take more rides that casual users.
* On weekends, casual users take more rides that members.
* On average, the rides of casual users lasted for **1146.44** seconds (19.12 minutes) more than those of members.
* Although members take more rides on weekdays than casual users, on average casual users rides Cyclistic bikes for longer durations every day of the week.
* Both members and casual users ride classic bikes the most.
* Casual users ride more docked bikes than members.
* Both members and casual users ride more electric and classic bikes than docked bikes.
* Both members and casual users ride docked bikes least.

# Recommendations

For a marketing strategy to convert more casual users into annual members, here are some recommendations to present to the Cyclistic executives for approval.
* Create an annual member promotion for **weekends**. Since casual member ride for much longer durations during the weekend, such a promotion will encourage them to subscribe.
* Create annual subription discount promotions for rides longer than **30 minutes**. Since casual user rides averagely last about 32 minutes, this offer will be enticing.
* Create an annual campaign offer for docked bike users. Since a lot more casual users ride docked bikes, they will be more likely to subscribe for annual membership.


### Export summary file for futher analysis

In [None]:
counts <- aggregate(data = trip_data_v4, ride_length ~ member_casual + day_of_week, FUN = mean)
write.csv(counts, file = 'avg_ride_length.csv')

##### Dataset Limitation
The main limitation of the dataset is that, it has a lot of missing values. Such a huge amount of missing data can affect the analysis, leading to inaccurate conclusions