# **Google Capstone Project** - **Cyclystic Case Study**
### This note book is created to share the process used in approaching the **Google Capstone Project**, ### **Case Study 1: How Does a Bike-Share Navigate Speedy Success?**

# **Background and Setting**
* #### **Cyclistic** is a Chicago based **Bike-share program** with a fleet of **5,824** bicycles that are geo-tracked and locked into a network of **692** stations.
* #### **The bikes** can be unlocked from one station and returned to any other station in the system at anytime.
* #### **Cyclistic** sets itself apart by not only offering standard **top seated bikes** but also offers **reclining bikes**, **hand tricycles**, and **cargo bikes**.
* #### **Cyclistic** business model provides a **more inclusive** program to people with **disabilities** and riders who are not able to use a **standard two-wheeled bike**.

  #### **Cyclistic** has two classifications for its users:
 1. #### Customers who purchase **single ride** or **full-day passes** are referred to as **Casual Riders**.
 1. #### Customers who purchase **annual memberships** are referred to as **Members**

## **Stakeholders in Our Scenario**
* #### **Cyclistic:** As the Company with the bike-share program. 
* #### **Lily Moreno:** The director of marketing and our manager.  
* #### **Cyclistic marketing analytics team:** In the scope of the case study this is our team of data analysts.
* #### **Cyclistic executive team:** The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program. 

## **Objectives and Goals**
  #### **Lily Moreno** **-** Director of marketing **-** has set a clear goal: Design marketing strategies aimed at converting Casual Riders into Annual Members . 
* #### **The Goal of the analytics team is to understand:**
1. #### How do Annual Members and Casual Riders use Cyclistic bikes differently?
1. #### Why would Casual Riders buy Cyclistic Annual Memberships?
1. #### How can Cyclistic use digital media to influence casual riders to become members?
#### This **presentation aims to answer** the question: - **How do Annual Members and Casual Riders use Cyclistic bikes differently?**

# **The Data**
* #### Data for this case study has been obtained from **Dibbybikes Trip Records** available through **[This Link](https://divvy-tripdata.s3.amazonaws.com/index.html).** 
* #### For the **purposes of this case study**, the datasets are appropriate and will enable us to answer the business questions. 
* #### The data has been made available by **Motivate International Inc.** under this **[License](https://ride.divvybikes.com/data-license-agreement)**.
* #### The data used for this analysis covers the Trips during a **12-Month period**, spanning between **January-2021 to December-2021**. 
* #### The data includes records fore **each registered trip**, its exact **starting** and **ending times**, **starting** and **ending stations**, and the **membership type** of the rider **(Casual Rider or a Member)**. 

## **Addressing Data Limitations**
#### **An important limitation in our data, is that due to privacy-related issues no personally identifiable information can be used.** 
* #### The privacy issues means that for our analysis we are not able to connect pass purchases to credit card numbers; 
* #### Resulting in the inability to determine if casual riders live in the Cyclistic service area or if they have purchased multiple single passes. 

#### **The data spans over 5,595,063 total rows; But contains missing values for 690,806 starting stations and 739,170 ending stations and their IDs.**
* #### The decision was made to not include these trip entries in the case study analysis, but in actual scenario we need to verify the reason of these blanks to make an informed decision about the measures to take towards the analysis. 

# **Data Handling and Initial Processing**
#### **Data was downloaded from the repository in the form of 12 CSV files, each file corresponds to the records of a month in the year 2021.**
#### **The data downloaded has been already processed from the source to:** 
* #### Remove trips that are taken by staff as they service and inspect the system;  
* #### Remove any trips that were below 60 seconds in length (potentially false starts or users trying to re-dock a bike to ensure it was secure).

#### **More information about the data rules and preprocessing from the source can be found at [this link](ride.divvybikes.com/system-data)**
#### **A backup of the raw data has been created, followed by a Naming Convection and structure for the files and folders.**
#### **Then, an inspection of data through Microsoft Power Query in Excel (Sample of the first 1,000 records has been inspected to verify the structure of data).**

## **Initial Setup**
### **Installing necessary packages**

#### **Through the processing and cleaning process we will use the following packages:**
 * #### **Janitor** - 
 * #### **readr** - 
 * #### **tidyverse** - 
 * #### **hms** - 
 * #### **lubridate** - 
 * #### **ggplot2** -
 


In [None]:
# The whole list of packages can be installed using the following command:
install.packages(c("janitor", "readr", "tidyverse", "hms", "lubridate", "ggplot2"))

In [None]:
# Load Libraries
library(tidyverse)
library(readr)
library(janitor)
library(lubridate)
library(hms)

### **Importing and Appending Data**
#### **We will begin by importing the data into our workspace** by using **read_csv()** and then appending them together using **rbind()**

In [None]:
# We will import our data from the working directory
## Syntax: variable_name <- read_csv("file path inside the working directory")
tripdata_01 <- read.csv("../input/divvytrips2021/202101-divvy-tripdata.csv") # January Data
tripdata_02 <- read.csv("../input/divvytrips2021/202102-divvy-tripdata.csv") # February Data
tripdata_03 <- read.csv("../input/divvytrips2021/202103-divvy-tripdata.csv") # March Data
tripdata_04 <- read.csv("../input/divvytrips2021/202104-divvy-tripdata.csv") # April Data
tripdata_05 <- read.csv("../input/divvytrips2021/202105-divvy-tripdata.csv") # May Data
tripdata_06 <- read.csv("../input/divvytrips2021/202106-divvy-tripdata.csv") # June Data
tripdata_07 <- read.csv("../input/divvytrips2021/202107-divvy-tripdata.csv") # July Data
tripdata_08 <- read.csv("../input/divvytrips2021/202108-divvy-tripdata.csv") # August Data
tripdata_09 <- read.csv("../input/divvytrips2021/202109-divvy-tripdata.csv") # September Data
tripdata_10 <- read.csv("../input/divvytrips2021/202110-divvy-tripdata.csv") # October Data
tripdata_11 <- read.csv("../input/divvytrips2021/202111-divvy-tripdata.csv") # November Data
tripdata_12 <- read.csv("../input/divvytrips2021/202112-divvy-tripdata.csv") # December Data

In [None]:
# We will combine our trip records in one file for easier analysis
## We will use rbind() to combine data frames by adding rows together
trips_2021 <- rbind(tripdata_01, tripdata_02, tripdata_03, tripdata_04,
      tripdata_05, tripdata_06, tripdata_07, tripdata_08,
      tripdata_09, tripdata_10, tripdata_11, tripdata_12)
# We can use our standard data frame inspection functions, such as:
## head(), colnames(), glimpse(), View()

In [None]:
# We can use our standard data frame inspection functions, such as:
## head(), colnames(), glimpse()
### In this example we will use dim() to determine the dimensions of our data frame
dim(trips_2021) # We have a total of 13 Columns and 5,595,063 rows

### **Column Selections and Cleaning**
#### We will then **filter out** records that are **fully empty records**, or contain missing enteries for the **starting station** and/or **ending station names** by using **filter()**
#### Then we will select the columns that are releveant to our analysis and what we are trying to answer

In [None]:
# We will remove fully empty rows by using drop_na()
# We filter out the rows where the start station and/or the end station are missing
# The number of rows after this filtering equates to 4,588,302      

trips_2021 <- trips_2021 %>% drop_na() %>% 
  filter(!is.na(end_station_name) | end_station_name != "") %>%
  filter(!is.na(start_station_name) | start_station_name != "")

In [None]:
# Now we will select only the data that is necessary for our analysis
# Columns listed in the selection function are columns to be removed from our data frame
# We are using -c() to select all but columns listed

trips_2021 <- trips_2021 %>% 
  select(-c(start_station_id, end_station_id, start_lat,
            start_lng, end_lat, end_lng, start_station_name, end_station_name))


### **Adding New Columns for Our Data**
#### We will create some new columns to further enhance our analysis results
#### We will add some calculated columns and some filtering columns

In [None]:
# We will use mutate() to manipulate the columns
# We will add start_weekday to find the exact weekday from a date entry
# We will adjust the format of started_at column and ended_at columns to dates
# We will extract the dates of which each trip started or ended 
trips_2021 <- trips_2021 %>% 
  mutate(start_weekday= wday(started_at, label=TRUE, abbr=FALSE),
         started_at=ymd_hms(started_at), ended_at=ymd_hms(ended_at),
         start_date=as.Date(started_at), end_date=as.Date(ended_at))

In [None]:
# We will add a column to calculate the duration of each trip in seconds
# We will combine as.numeric() and difftime() from "hms" package
# difftime() calculates the differences between two times in any time unit
# as.numeric() is used to remove the unit and adjust the type of the record to ease calculations
trips_2021$trip_duration_secs <- as.numeric(difftime(trips_2021$ended_at,
                                                     trips_2021$started_at, 
                                                     units="secs"), units = "secs")

In [None]:
# We will create another set of columns grouping purposes
   
    # The month in which a trip ended
trips_2021$start_month <- month(trips_2021$ended_at)
    # The hour of the day in which a trip started
trips_2021$start_hour <- hour(trips_2021$started_at) 
    # We will add the abbreviation of each month's name for cleaner plots
trips_2021$months_abbre <- month.abb[trips_2021$start_month] # ex. 2-> Feb

In [None]:
# We will set the order of the weekdays to start on monday
# In order to be in the same week format as used in Chicago
trips_2021$start_weekday <- ordered(trips_2021$start_weekday, 
                                      levels=c("Monday", "Tuesday", "Wednesday",
                                               "Thursday", "Friday",
                                               "Saturday", "Sunday"))

# **Analyzing our Data**

### **First, we will take a look at how the number of riders differ by the month across the year**
* #### We will create a table for the data summary we will obtain using summarize(), We will group our results by the month, month's abbreviation, and membership type.
* #### Then we will order the summary by the numeric represntation of the month

In [None]:
trips_monthly <- trips_2021 %>%
  group_by(start_month, member_casual, months_abbre) %>% 
  summarize(total_monthly_rides=n()) %>% 
  arrange(start_month)

#### **To produce the following plot we use the ggplot as follows**
**![Monthly rides distribution across the year](https://drive.google.com/file/d/1Wb96eTCsMYQsF8QUnW2o-xSwcLC5Xo3z/view)**

In [None]:
# Inside the geom_col() we used position="dodge" to avoid making a stacked column chart which results by default
# theme() can be used to customize any aspects of the plot, we used it here to move the legend to the top
#  scale_y_continuous() uses this structure to customize what markers are shown on the y-axis and their labels
trips_monthly %>% 
  ggplot(aes(reorder(x=months_abbre, start_month), y=total_monthly_rides, fill=member_casual)) +
  geom_col(position = "dodge") + 
  labs(title="Total number of Rides per Month", caption="Data provided by: Motivate International Inc",
       x="", y="Number of rides") + 
  guides(fill=guide_legend(title="")) + 
  theme(legend.position="top", legend.key.size = unit(0.3, 'cm')) +
  scale_y_continuous(breaks = c(75000, 150000, 225000, 300000, 375000), labels=c("75K", "150K", "225K", "300K", "375K"))

### **Secondly, weekly rides, we will view how the numbers of riders differ by the day of the week across our data set**
* #### We will create a table for the data summary we will obtain using summarize(), We will group our results by the weekdat at which the trip started, and membership type.
* #### We already set an order for our week days to start on Monday so we won't need that here

In [None]:
trips_weekly <- trips_2021 %>% group_by(start_weekday, member_casual) %>% 
  summarize(total_weekly_rides=n())

#### **To produce the following plot we use the ggplot as follows**
**![Rides during each weekday](https://drive.google.com/file/d/1TAbUly6xnZRYthc_2MaRkO_wsYefuiyt/view?usp=sharing)**

In [None]:
trips_weekly %>% ggplot(aes(x=start_weekday, y=total_weekly_rides, fill = member_casual)) +
  geom_col(position = "dodge") + 
  labs(title="Total Weekly Rides per Weekday",
       caption="Data provided by: Motivate International Inc", x="", y="Number of rides") +
  guides(fill=guide_legend(title="")) +
  theme(legend.position="top") +
  scale_y_continuous(breaks = c(75000, 150000, 225000, 300000, 375000, 450000),
                     labels=c("75K", "150K", "225K", "300K", "375K", "450K"))

### **Now, to take a look at the rides started during each hour of the day**
* #### We will create a table for the data summary we will obtain using summarize(), We will group our results by the starting hour at which the trip started, and membership type.
* #### Our data is using 24-hour format so no ordering is required here

In [None]:
trips_hourly <- trips_2021 %>% group_by(start_hour, member_casual) %>% 
  summarize(trips_at_hour=n())

#### **To produce the plot we use the ggplot as follows**

In [None]:
trips_hourly %>% ggplot(aes(x=start_hour, y=trips_at_hour, color=member_casual)) +
  geom_line(position = "dodge", size=2, alpha=0.9, linetype=2) + 
  scale_y_continuous(breaks = c(75000, 150000, 225000, 300000, 375000, 450000),
                     labels=c("75K", "150K", "225K", "300K", "375K", "450K"))

### **Finally, we take a look at the average ride duration during each day of the week**
* #### We will create a table for the data summary we will obtain using summarize(), We will group our results by the day of the week at which the trip started, and membership type.
* #### We already set an order for our week to start on monday so no ordering is required here

In [None]:
trips_average <- trips_2021 %>% group_by(start_weekday, member_casual) %>%
  summarize(average_trip_mins=mean(trip_duration_secs/60))

In [None]:
trips_average %>% ggplot(aes(x=start_weekday, y=average_trip_mins, fill = member_casual)) +
  geom_col(position = "dodge") + 
  labs(title="Average time for a single trip during each weekday",
       caption="Data provided by: Motivate International Inc", x="", y="Duration (Minutes)") +
  guides(fill=guide_legend(title="")) +
  theme(legend.position="top") +
  scale_y_continuous(breaks = c(5, 10, 15, 20, 25, 30, 35),
                     labels=c("5", "10", "15", "20", "25", "30", "35"))

# **Key Takeaways**
1. #### The number of trips from both user bases increase during the Warm Season in Chicago (June 3rd to September 20th). 
1. #### Casual Riders peak and surpass the number of Members’ trips during weekends. 
1. #### The peak number of trips from both Members and Casual Riders occurs during Chicago rush hour (Chicago Rush Hour historically occurs between **7 a.m. to 9 a.m. in the morning** and **15 p.m. to 19 p.m in the evening* 
1. #### Casual Riders spend an average 50% more time per each trip. 

# **Recommendations**
1. #### Use the Chicago Warm Season as the focus period of the marketing strategy, since this period is the most suited for bike rides and has the highest trip count. 
1. #### Introduction of a new pricing plan; A Weekend Pass would be a good incentive to Casual Riders to buy a membership; The average trip time for Casual Riders during the weekends is +35 minutes, so a tier of 45 minutes is recommended. 
1. #### Use push notifications during rush hours suggesting using a bike today, and highlight or offer competitive rates during these periods. 
1. #### Highlighting the benefits of the different passes to riders with longer trip durations; for example, highlight the price adjustments if Casual Riders were members.

# **Considerations**
### **More data points can be needed to further enhance the scope of the analysis, such as:**
* #### **Information about the service users** (Gender, Age, Physical health; This can enhance the marketing strategy and improve understanding of the targeted demographic group) 
* #### **Information about the premium plans** (Prices, Tiers, Student incentives; These data points can allow us to better understand potential reasons why users gravitate towards a plan versus another, and the potential growth aspects)   

### **Thank you for you time and patience; any feedback is appreciated. A link of a power powint presentation will be linked here to tell the story of our data and findings.**