# **Scenario**

Cyclistic is a bike-share program of 5,800 bicycles and 600 docking stations. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic offers pricing plans: 
- single-ride passes for casual riders, 
- full-day passes for casual riders, 
- annual memberships for Cyclistic members. 

Annual members are much more profitable than casual riders

The director of marketing Lily Moreno believes the company’s future success depends on *maximizing the number of annual memberships - convert casual riders into members*. 
Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

Cyclistic marketing analytics team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.


### **Ask**
1. How do annual members and casual riders use Cyclistic bikes differently?
2. Why would casual riders buy Cyclistic annual memberships?
3. How can Cyclistic use digital media to influence casual riders to become members?

Business task

What can stimulate the conversion of "casual riders" into "members" based on the differences in these kinds of users? 

Stakeholders

Lily Moreno, the director of marketing, the rest of the marketing analytics team, and the Cyclistic executive team.

### **Prepare**

The data has been made available by Motivate International Inc. under this [license](https://ride.divvybikes.com/data-license-agreement).

For this analysis data for 2022 was taken.






In [None]:
# adding libraries

library(tidyverse)
library(lubridate)
library(ggplot2)

In [None]:
#loading data

Jan <- read.csv("../input/divvytripdata-2022-dataset/202201-divvy-tripdata/202201-divvy-tripdata.csv")
Feb <- read.csv("../input/divvytripdata-2022-dataset/202202-divvy-tripdata/202202-divvy-tripdata.csv")
Mar <- read.csv("../input/divvytripdata-2022-dataset/202203-divvy-tripdata/202203-divvy-tripdata.csv")
Apr <- read.csv("../input/divvytripdata-2022-dataset/202204-divvy-tripdata/202204-divvy-tripdata.csv")
May <- read.csv("../input/divvytripdata-2022-dataset/202205-divvy-tripdata/202205-divvy-tripdata.csv")
Jun <- read.csv("../input/divvytripdata-2022-dataset/202206-divvy-tripdata/202206-divvy-tripdata.csv")
Jul <- read.csv("../input/divvytripdata-2022-dataset/202207-divvy-tripdata/202207-divvy-tripdata.csv")
Aug <- read.csv("../input/divvytripdata-2022-dataset/202208-divvy-tripdata/202208-divvy-tripdata.csv")
Sep <- read.csv("../input/divvytripdata-2022-dataset/202209-divvy-tripdata/202209-divvy-publictripdata.csv")
Oct <- read.csv("../input/divvytripdata-2022-dataset/202210-divvy-tripdata/202210-divvy-tripdata.csv")
Nov <- read.csv("../input/divvytripdata-2022-dataset/202211-divvy-tripdata/202211-divvy-tripdata.csv")
Dec <- read.csv("../input/divvytripdata-2022-dataset/202212-divvy-tripdata/202212-divvy-tripdata.csv")

In [None]:
#check the structure of the tables
colnames(Jan)
colnames(Feb)
colnames(Mar)
colnames(Apr)
colnames(May)
colnames(Jun)
colnames(Jul)
colnames(Aug)
colnames(Sep)
colnames(Oct)
colnames(Nov)
colnames(Dec)

#they look the same

In [None]:
#merging the datasets

all_trips <- bind_rows(Jan, Feb, Mar, Apr, May, Jun, Jul, Aug, Sep, Oct, Nov, Dec)

### **Process**

 The new dataset that has been created needs inspection

In [None]:
#inspecting the new dataset

glimpse(all_trips)
summary(all_trips)

In [None]:
#column "member_casual" consists from only "member" or "casual" state
unique(all_trips[c("member_casual")])

**Cleaning the new dataset**

The new dataset has NA data and the nexst step is to clean it

In [None]:
#cleaning NA 
all_trips_clean <- drop_na(all_trips)

In [None]:
#check the result
summary(all_trips_clean)
head(all_trips_clean)
tail(all_trips_clean)

**Add data and prepare for analysis**

Add columns that list the date, month, day, and year of each ride and a ride duration as a ride_length in seconds

In [None]:
all_trips_clean$date <- as.Date(all_trips_clean$started_at) #The default format is yyyy-mm-dd
all_trips_clean$month <- format(as.Date(all_trips_clean$date), "%B")#extract the month as a string name
all_trips_clean$day <- format(as.Date(all_trips_clean$date), "%d")
all_trips_clean$year <- format(as.Date(all_trips_clean$date), "%Y")
all_trips_clean$day_of_week <- format(as.Date(all_trips_clean$date), "%A")

all_trips_clean$ride_length <- difftime(all_trips_clean$ended_at,all_trips_clean$started_at) #ride duration in seconds

#convert ride_length from factor to numeric

all_trips_clean$ride_length <- as.numeric(as.character(all_trips_clean$ride_length))

In [None]:
# Inspect the structure of the columns
str(all_trips_clean)

Clean the dataset from negative values and 0.
The dataframe includes entries when ride_length was negative or equal to 0

In [None]:
negative_ridelegnth <- all_trips_clean %>%
filter(ride_length <= 0)

In [None]:
glimpse(negative_ridelegnth)

In [None]:
all_trips_clean2 <- all_trips_clean[!(all_trips_clean$ride_length<=0),]
glimpse(all_trips_clean2)

Inspection "rideable_type"

In [None]:
unique(all_trips_clean2[c("rideable_type")])

There are 3 types of bikes - electic, classic and docked

In [None]:
all_trips_clean2 %>% 
    group_by(member_casual, rideable_type) %>% 
    summarize(average_trip_duration = mean(ride_length))

The cleaned dataset for analysis all_trips_clean2 consists of the trip details for 2 types of riders, so-called casual and member. Casual riders use classic bikes, docked bikes and electric bikes, and member riders - classic bikes and electric bikes.
The average trip duration for the docked bikes is significantly higher than for other types, it will be inspected later.

### **Analyze**

First, take a look on the duration of trips

In [None]:
all_trips_clean2 %>% 
    group_by(member_casual, rideable_type) %>% 
        summarize(number_of_rides = n(),
                  average_duration = mean(ride_length),
                 max_duration = max(ride_length))

The longest ride for the casual rider who took docked bike looks too long - almost 24 days:

Let's check how many users took bikes for a long time or if it is inaccurate data. First, let's plot the all rides for casual users.

In [None]:
all_trips_clean2 %>%
    filter(member_casual == "casual")%>%
    ggplot() +
     geom_point(aes(x = date, y = ride_length, colour = rideable_type))

A few docked bikes were taken for unexpectedly long rides.

In [None]:
casual_docked_bike <- all_trips_clean2 %>%
    filter(member_casual == "casual",rideable_type == "docked_bike" )%>%
    summarise(ride_length)

k <- unlist(casual_docked_bike, use.names = FALSE) #vecrot of ride durations casual users with docked bikes

quantile(k, probs = c(.25, .5, .75, .90, .95, .98, .999, 1)) 

This result means that 0.1% of the rides made by the casual users who took docked bikes were longer than 78362 seconds = ~21 hours.

The longest ride among all categories was performed by casual users with classics bike and qual to 93581 seconds = ~26 hours

So for the following analysis the data related to tips longer 93581 seconds is filtered out:


In [None]:
all_trips_clean3 <- all_trips_clean2[!(all_trips_clean2$ride_length > 93581),]

In [None]:
all_trips_clean3 %>%
    filter(member_casual == "casual")%>%
    ggplot() +
     geom_point(aes(x = date, y = ride_length, colour = rideable_type))

In [None]:
summary <- all_trips_clean3 %>% 
    group_by(member_casual, rideable_type) %>% 
        summarize(number_of_rides = n())
summary

In [None]:
summary %>%
filter(member_casual == "casual") %>%
mutate(percent = number_of_rides/sum(number_of_rides)*100)

7.5% of casual riders took docked bikes, 38% - classics and 54% took electic bikes.

**Now let's back to analysis.**

To find the difference between the two types of riders, average trip duration, median, longest and shortest rides were calculated. 
Then for each day of week.

In [None]:
all_trips_clean3 %>% 
    group_by(member_casual, rideable_type) %>% 
        summarize(mean = mean(ride_length),
            median = median(ride_length),
            longest = max(ride_length),
            shortest = min(ride_length))

In [None]:
all_trips_clean3$day_of_week <- ordered(all_trips_clean3$day_of_week, levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))

all_trips_clean3 %>% 
    group_by(member_casual, day_of_week) %>% 
        summarize(mean = mean(ride_length),
            median = median(ride_length),
            longest = max(ride_length),
            shortest = min(ride_length))

Count of rides and average ride duration taken by each rider type and grouping them into days of the week.

In [None]:
all_trips_clean3 %>%
    mutate(weekday = wday(started_at, label = TRUE, abbr = FALSE)) %>% 
    group_by(member_casual, weekday) %>%  
    summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
    arrange(member_casual, weekday)

 Closer look on the rides during different months 

In [None]:
all_trips_clean3$month <- ordered(all_trips_clean3$month, levels=c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))

sum_months <- all_trips_clean3%>% 
filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")%>%
    group_by(member_casual, month) %>% 
    summarize(number_of_rides = n(),average_duration = mean(ride_length),
             max_ride_length = max(ride_length)/60)

sum_months



### **Share**

**Data visualizations**

In [None]:
all_trips_clean3 %>%
 mutate(weekday = wday(started_at, label = TRUE, abbr = FALSE)) %>%
 group_by(member_casual, weekday) %>%
 summarise(number_of_rides = n()
           ,average_duration = mean(ride_length)) %>%
 arrange(member_casual, weekday)  %>%
 ggplot(aes(x = weekday, y = number_of_rides, fill = member_casual)) +
 geom_col(position = "dodge") + xlab("Weekday") + ylab("Number of rides")

"Casual" users are taking bikes on weekends more often but "Members" are using bikes more often on working days.

In [None]:
all_trips_clean3 %>%
 mutate(weekday = wday(started_at, label = TRUE, abbr = FALSE))  %>%
 group_by(member_casual, weekday) %>%
 summarise(number_of_rides = n()
           ,average_duration = mean(ride_length)) %>%
 arrange(member_casual, weekday)  %>%
 ggplot(aes(x = weekday, y = (average_duration/60), fill = member_casual)) +
 geom_col(position = "dodge") + xlab("Weekday") + ylab("Average duration, min")

"Casual" users are taking bikes for longer (in terms of time) rides each weekday than annual members.

In [None]:
#dataset where users are summurised in terms of rideable_type, weekday, rider and duration

sum_ride_type <- all_trips_clean3 %>% 
mutate(weekday = wday(started_at, label = TRUE))  %>%
#filter(rideable_type=="classic_bike" | rideable_type=="electric_bike" | rideable_type=="docked_bike")%>%
    group_by(member_casual, rideable_type, weekday) %>% 
        summarize(number_of_rides = n(),average_duration = mean(ride_length))


In [None]:
sum_ride_type

In [None]:
ggplot(data = sum_ride_type) +
  geom_col(aes(x = weekday, y = average_duration/60, fill = rideable_type), position = "dodge") +
  facet_wrap(~member_casual) + xlab("Weekday") + ylab("Average duration, min")


In [None]:
sum_ride_type %>%
#arrange(rideable_type, weekday)  %>%
ggplot() + geom_col(aes(x = weekday, y = average_duration/60, fill = rideable_type)) +
  facet_wrap(~member_casual) + xlab("Weekday") + ylab("Average duration, min")


In [None]:
ggplot(data = sum_ride_type) +
  geom_col(aes(x = weekday, y = number_of_rides, fill = rideable_type)) +
  facet_wrap(~member_casual) + xlab("Weekday") + ylab("Number of rides")

ggplot(data = sum_ride_type) +
  geom_col(aes(x = weekday, y = number_of_rides, fill = rideable_type), position = "dodge") +
  facet_wrap(~member_casual) + xlab("Weekday") + ylab("Number of rides")

Classics bikes and electric bikes are popular among users.

"Members" are taking almost the same amount of classic and electric bikes, when electric bikes are more populat among "casual" users.

Meanwhile average duration of rides are longer for "casual" users.

In [None]:
sum_months %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual)) +
 geom_col(position = "dodge")

sum_months %>% 
  ggplot(aes(x = month, y = average_duration/60, fill = member_casual)) +
 geom_col(position = "dodge") + ylab("Average duration, min")

"Member" users ride bicycle more often during the whole year than "casual". "Casual" users prefer to ride during summer months, amount of rides drops closer to winter. 

It's inretesting that average duration of "member's" rides is more less the same during the whole year. But for "casual" users it is quite high for January and February, I believe it can be explained by fewer but longer rides. Closer look on Jan data:

# **Conclusion and Act**


Annual members use Cyclistic classic bikes for shorter rides more often on workdays.

Casual riders use Cyclistic electric bikes mostly on weekends for longer rides.
The difference between the number of rides during summer and winter months is bigger for casual riders.

It may indicate that casual riders use bikes for entertainment purposes, especially during the summer months.

The casual riders would buy Cyclistic annual memberships if the company offers special conditions for weekends or summer time.

Cyclistic can use digital media to influence casual riders to become members by creating gamification for members.
