In [None]:

---
title: "Cyclistic Bike Share"
author: "Deepak"
date: '2022-07-20'
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction

**Cyclistic** is a Chicago-based bike-share company operating 5,824 bicycles across 692 strategically located stations in the city. The company offers riders the convenience of picking up a bicycle at one station and returning it to any other station, providing a flexible and eco-friendly mode of transportation. Cyclistic serves two primary customer groups: casual riders, who purchase single-ride or full-day passes, and Cyclistic members, who commit to annual memberships.

The primary focus of this report is to address Cyclistic&rsquo;s key business challenge: converting casual riders into loyal annual members. This conversion is crucial for the company&rsquo;s sustained growth, aimed at cultivating customer loyalty, increasing membership revenue, and establishing Cyclistic as a leader in the competitive bike-sharing industry.



## Business Objectives

Cyclistic has set out the following objectives for this data-driven analysis:

1. **Understand Casual Rider Behavior:** Gain a comprehensive understanding of how casual riders utilize Cyclistic&rsquo;s services and how their behavior differs from that of annual members.

2. **Identify Conversion Opportunities:** Identify trends, patterns, and opportunities within Cyclistic&rsquo;s historical bike trip data that can be leveraged to encourage casual riders to transition into annual memberships.



# Data Preparation

To conduct a meaningful analysis, it&rsquo;s essential to understand the dataset&rsquo;s source and characteristics. The dataset used for this analysis is publicly accessible through the [Divvy Trips Data website](https://divvy-tripdata.s3.amazonaws.com/index.html) and is provided by Motivate International Inc. under their [data license agreement](https://ride.divvybikes.com/data-license-agreement).

**Dataset Overview:**
- This dataset contains ride details stored in monthly CSV files. Additionally, it includes station data, detailing station IDs, names, locations, and capacities.
- Trip data files provide essential information such as unique ride IDs, trip duration, start and end stations, and bike types.
- The analysis will focus solely on Cyclistic&rsquo;s user data, making it a credible and pertinent source for insights.

**Data Quality:**
- The dataset maintains high data quality with unique ride IDs, ensuring data reliability and integrity.
- It adheres to the ROCCC criteria: Reliable, Original, Comprehensive, Current, and Cited. This reflects its trustworthiness, comprehensiveness, and up-to-date nature.

**Data Licensing and Privacy:**
- The dataset is licensed by Motivate International Inc. and does not contain personal rider information, complying with data privacy regulations.

**Data Structure:**
- Each CSV file follows a consistent structure with appropriate columns and data types.
- Station data is available for 2013 and 2014, while trip data spans from 2013 to 2022, allowing for a comprehensive temporal analysis.

This understanding of the dataset&rsquo;s source, quality, and structure will guide our analysis, ensuring meaningful insights and data-driven recommendations.



## Processing the Data

In the data preparation phase, we transformed the raw dataset to make it suitable for analysis. Initially, Google Sheets was used, but due to the large file sizes, we transitioned to **RStudio Desktop** for more efficient data processing. Our analysis will focus on data from the most recent year, spanning from *July 2021 to June 2022*.

Here are the key steps in data processing:

- **Duplicates Removal**: We began by checking for and eliminating duplicate records within the dataset. The `distinct()` function from the `dplyr` package was employed, ensuring that only unique rows were retained.

- **Date and Time Conversion**: To facilitate calculations, we converted the start and end time columns from string format to POSIXct (date-time) format using the `as.POSIXct()` function.

- **Derived Columns**: New columns were created to enhance our analysis. These include:
  - `ride_length`: Calculated as the time difference between ride end and start times.
  - `day_of_week`: Extracted from the start time to identify the day of the week in abbreviated form.
  - `start_hour`: Extracted from the start time to identify the hour of the ride&rsquo;s commencement.
  - `year` and `month`: Extracted to categorize rides by year and month using labels.

- **Handling Missing Values**: We checked for missing or null values in the dataset to ensure data completeness and reliability.

- **Filtering Data**: We applied filtering to the `ride_length` column to remove rides with durations less than one minute. This also effectively removed negative values, ensuring that only meaningful ride data remained.

- **Exporting Data**: Finally, the processed data was converted to Excel format using the `writexl` package, and the resulting CSV files were saved after updating.

Data processing steps:

```{r code-for-cleaning, echo=TRUE, message=FALSE, warning=FALSE, results='hide'}
library(dplyr)
library(ggplot2)
library(tidyverse)
library(lubridate)
library(glue)

files <- list("202107-divvy-tripdata","202108-divvy-tripdata",
              "202109-divvy-tripdata","202110-divvy-tripdata","202111-divvy-tripdata",
              "202112-divvy-tripdata","202201-divvy-tripdata","202202-divvy-tripdata",
              "202203-divvy-tripdata","202204-divvy-tripdata","202205-divvy-tripdata",
              "202206-divvy-tripdata")
clean_files <- list()
i <- 1
for (f in files){
  
  df <- read.csv(glue("../input/cyclistic-bike-share/{f}.csv"))
  
  #Removing duplicates
  df %>% distinct()
  
  #Converting start and end time from string datatype to DateTime.
  df$started_at <- as.POSIXct( df$started_at, tz = "" ) 
  df$ended_at <- as.POSIXct( df$ended_at, tz = "" )
  
  #Adding new rows to calculate ride length, day of week and start hour.
  df_new<-df %>%
    mutate(ride_length=ended_at-started_at) %>%
    mutate(day_of_week=weekdays(started_at,abbreviate = TRUE))%>%
    mutate(start_hour=hour(started_at)) %>% 
    mutate(year=year(started_at)) %>% 
    mutate(month=month(started_at,label=TRUE)) 
  
  #Filtering dataframe with negative and small values of ride length.
  filtered_df <- df_new %>% 
    filter(ride_length>=60)
    
  #Saving cleaned dataframe to a separate dataframe named df1,df2,df2....
  clean_files[[i]] <- filtered_df
  i <- i+1
  
}

# final file which has clean data 
all_trips <- bind_rows(clean_files)
```

The structure of the final dataframe is given below.

```{r clean data output , echo=FALSE}
str(all_trips)
```

With the dataset now prepared and cleaned, we are poised to proceed with our analysis and derive valuable insights.

## Exploratory Data Analysis


We now proceed to the exploratory data analysis (EDA) phase to uncover valuable insights.


First 6 rows of the dataframe:
```{r echo=FALSE, message=FALSE, warning=FALSE}
head(all_trips) 
```

Dimensions of the dataframe (rows x columns):
```{r echo=FALSE, message=FALSE, warning=FALSE}
dim(all_trips) 
```


The standard deviation and mean of the `ride_length` column to gain insights into ride duration is shown below:


```{r echo=FALSE, message=FALSE, warning=FALSE}
s <- sd(all_trips$ride_length)
m <- mean(all_trips$ride_length) 
```

| Statistic           | Value   |
|---------------------|---------|
| Standard Deviation  | `r s`   |
| Mean                | `r m`   |



In our analysis of the `ride_length` column, we observed that the mean value is approximately 1237 seconds (20 minutes), while the maximum value is an extreme 2,946,429 seconds (approximately 34 days). This extreme value is considered an outlier in the dataset. To address this, we calculated the standard deviation for the `ride_length` column, which yielded a value of approximately 9391.46.

To effectively identify and remove outliers, we employed the Z-score method. The Z-score measures the number of standard deviations a given value deviates from the mean and is computed using the formula:

\[Z = \frac{{X - \mu}}{{\sigma}}\]

Here, \(Z\) represents the Z-score, \(X\) is the individual value, \(\mu\) is the mean, and \(\sigma\) is the standard deviation.

To remove outliers from the dataset, we introduced a new column called `z_score` to calculate the Z-scores for each row. Subsequently, we filtered the data to retain only those rows with Z-scores less than 3, which is a common threshold for identifying outliers.

The summary statistics for the `ride_length` column after removing outliers are as follows:

```{r echo=TRUE, message=FALSE, warning=FALSE}
# Adding a new column for z-scores to identify outliers
all_trips_1 <- all_trips %>%
  mutate(z_score = (abs(ride_length - m) / s))

# Removing outliers
all_trips_new <- subset(all_trips_1, z_score < 3)

s_new <- sd(all_trips_new$ride_length)
m_new <- mean(all_trips_new$ride_length) 
```

| Statistic           | Value   |
|---------------------|---------|
| Standard Deviation  | `r s_new`   |
| Mean                | `r m_new`   |



To gain insights into Cyclistic&rsquo;s customer base, we analyze the composition of members and casual riders. This analysis helps us understand the proportion of annual members compared to casual riders in the dataset. We calculate the percentage of each group within the cleaned dataset.

```{r echo=TRUE, message=FALSE, warning=FALSE}
#calculating percentage of members
all_trips_new %>% 
  group_by(member_casual) %>% 
  summarise(count = length(ride_id),
            '%' = (length(ride_id) / nrow(all_trips_new)) * 100)

```

# Results

In this section, we present the key findings from our analysis of Cyclistic&rsquo;s bike-share data. We utilized both Tableau Desktop and R to visualize and interpret the data. You can explore the interactive dashboards created with Tableau in this [**Tableau Story**](https://public.tableau.com/views/CyclisticProject_16584852469400/Story1?:language=en-US&:display_count=n&:origin=viz_share_link).

## Membership Composition

Our analysis revealed the distribution between Cyclistic&rsquo;s annual members and casual riders in the past year. The table and graph below summarize our findings:

| Membership Type | Number of Rides | Percentage |
|-----------------|-----------------|------------|
| Casual          | 2,511,661       | 43.4%      |
| Member          | 3,280,339       | 56.6%      |

![Membership Composition](https://i.postimg.cc/9X4VGtJL/percentage.png)

These statistics illustrate that, over the past year, 43.4% of rides were taken by casual riders, while 56.6% were by Cyclistic members. This indicates that a substantial portion of rides, approximately 3.3 million, were taken by non-members. Converting these casual riders into annual members represents a significant growth opportunity for Cyclistic.

## Ride Distribution by Weekday

The graph below showcases the distribution of rides on weekdays and weekends:

![Weekday Ride Distribution](https://i.postimg.cc/Dfth5JNn/weekends.png)

- Sundays experience the highest ridership, followed closely by Saturdays.
- Casual riders tend to ride more on weekends, while members predominantly use the service on weekdays. This pattern suggests that casual riders use bikes for leisure and weekend activities, presenting an opportunity for targeted promotions during these periods.

## Rides Per Hour Throughout the Week

The following chart illustrates the distribution of rides across different hours of the day throughout the week:

![Hourly Ride Distribution](https://i.postimg.cc/kDrT8dfP/rides-weekly.png)

- On weekdays, the peak hours are at 5 PM and 6 PM, while mornings show lower ridership.
- Weekends see the most rides between 10 AM and 6 PM, with the peak occurring between 2 PM and 4 PM.
- Notably, casual riders exhibit higher ridership during weekends, with 3 PM being the busiest time.

## Bike Type Preference

This distribution chart displays the type of bikes used and their average ride length in seconds:

![Bike Type Preference](https://i.postimg.cc/50R74zz0/type-bike.png)

- Casual riders favor docked bikes, followed by classic and electric bikes.
- Members mainly use classic and electric bikes, with similar ride lengths.
- On average, casual riders&rsquo; docked bike rides last approximately 2,650 seconds (44 minutes).

## Station Traffic Data

For both casual and member riders, weekends are the busiest. The map below presents weekend traffic data across Chicago:

![Station Traffic Data](https://i.postimg.cc/cJ4G6QvM/traffic.png)

- Stations in Central and North Chicago witness the highest rider activity among both casual and member riders.
- Identifying and targeting stations with high ridership could be an effective strategy to attract and retain casual riders.

## Annual Ride Length Trends

This chart provides insights into the average ride length throughout the year:

![Annual Ride Length Trends](https://i.postimg.cc/pr50pdQB/yearly.png)

- Casual riders consistently record longer rides, with July and August marking the peak.
- Higher ride lengths during these months may be attributed to the summer season and school holidays.

## Hourly Ride Trends

The chart below presents the number of rides by the hour:

![Hourly Ride Trends](https://i.postimg.cc/yYcHG6q9/rides-hourly.png)

- 5 PM emerges as the peak time for rides, with casual riders accounting for nearly 600,000 rides, almost double the number of member rides during this period.

These findings provide valuable insights for Cyclistic, highlighting areas of opportunity and suggesting strategies for converting casual riders into loyal annual members.


# Conclusion

In conclusion, our analysis of Cyclistic&rsquo;s bike-share data has revealed significant differences in usage patterns between casual riders and members:

- Casual riders tend to prefer docked bikes, whereas members are more diversified in their choice of bike types.

- Weekend rides are more common among casual riders, while members predominantly use the service on weekdays. However, the sheer volume of rides taken by casual riders is notably higher than that of members.

- Morning hours see a higher concentration of members, possibly indicating their reliance on Cyclistic for daily commuting.

- On average, casual riders have significantly longer ride durations compared to members.

- July emerges as the peak month for bike rides, regardless of rider type, possibly attributed to summer weather and school holidays.

- The busiest time of day for both rider groups is 5 pm.

To leverage these insights for the benefit of Cyclistic and increase annual memberships, the following recommendations are proposed:

- Implement pricing strategies that target casual riders during weekends, such as promotional discounts for weekend rides, coupled with reduced membership prices.

- Advertise membership benefits during peak riding hours to encourage casual riders to subscribe.

- Explore the introduction of monthly membership plans as an attractive option for converting casual riders into regular members.

It&rsquo;s worth noting that obtaining data on current members and non-members, beyond ride IDs, could provide deeper insights into the study. This case study has enhanced proficiency in data analytics, using tools like R, Tableau, spreadsheets, and SQL, and has honed the ability to address critical business questions in a structured manner. The key steps of data analytics - Ask, Prepare, Process, Analyze, Share, and Act - have been successfully put into practice.