# **INTRODUCTION**

Welcome to the Cyclistic bike-share data analysis! 
In this analysis we will work for a fictional company - Cyclistic in order to answer the key business questions by following the steps of data analysis - **ask, prepare, process, analyze, share and act.**

**SCENARIO**

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of  marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore,your team wants to understand **how casual riders and annual members use Cyclistic bikes differently.** 

From these insights,your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations

**CHARACTERS AND TEAMS**

● Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself
apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with
disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.

● Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns
and initiatives to promote the bike-share program. These may include email, social media, and other channels.

● Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and
reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy
learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic
achieve them.

● Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the
recommended marketing program.

**ABOUT THE COMPANY**

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.
One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the
pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: **Design marketing strategies aimed at converting casual riders into annual members**. In order to do that, however, the marketing analyst team needs to better understand *how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics.* Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.

# **ASK**

**Guiding questions**

● What is the problem you are trying to solve?
We are trying to analyze how annual mebers are different from casual members.

● How can your insights drive business decisions?
My insights will help the marketing team to derive startegies that will help the company to convert casual members to annual members.

**Business task** : How are casual and annual members different?

**Stakeholdres** : Lily Moreno and Cyclistic marketing analytics team.

# **PREPARE**

**Guiding questions**

● Where is your data located? The data is located on kaggle dataset.

● How is the data organized? The data is organized monthly for each year.

● Are there issues with bias or credibility in this data? Does your data ROCCC? There is no bias in data and it is credible. The data ROCC beacuse it is reliable, original, comprehensive, current and cited.

● How are you addressing licensing, privacy, security, and accessibility? The data is licesenced and the personal information of riders is not shared.

● How did you verify the data’s integrity? The data is consistent and has the required information.

● How does it help you answer your question? The data contains historical bike trip info which will be useful for our analysis. 

● Are there any problems with the data? The data needs some cleaning and manipulation.


**Key tasks**

1. Download data and store it appropriately. 
2. Identify how it’s organized.
3. Sort and filter the data. 
4. Determine the credibility of the data.

**Loading the required libraries**

In [1]:
library(tidyverse)
#for pipe operator
library(tidymodels)

**Loading the divvy trps data for the year 2021 and combining it in one data frame**

In [2]:
csv_files <- list.files(path = "../input/divvytrips2021", recursive = TRUE, full.names=TRUE)
cyclistic_df <- do.call(rbind, lapply(csv_files, read.csv))

In [3]:
head(cyclistic_df)

# **PROCESS**

 **CLEANING THE DATA**

1. Removing Duplicates

In [4]:
cyclistic_df_no_dups <- cyclistic_df[!duplicated(cyclistic_df$ride_id), ]
print(paste("No. of duplicate entries removed: ", nrow(cyclistic_df) - nrow(cyclistic_df_no_dups)))

2. Correcting the data types

In [5]:
cyclistic_df_no_dups$started_at <- lubridate :: ymd_hms(cyclistic_df_no_dups$started_at)
cyclistic_df_no_dups$ended_at <- lubridate :: ymd_hms(cyclistic_df_no_dups$ended_at)

3. Manipulating data for analysis

In [6]:
#adding a column which calculates the ride length for each member
cyclistic_df_no_dups$ride_length <- as.numeric((cyclistic_df_no_dups$ended_at - cyclistic_df_no_dups$started_at)/60)

In [7]:
#calculating weekday from date
cyclistic_df_no_dups$weekday <- lubridate :: wday(as.Date(cyclistic_df_no_dups$started_at))

In [8]:
#find the month,year and hour for each data point
cyclistic_df_no_dups$month_year <- format(as.Date(cyclistic_df_no_dups$started_at), "%m-%Y")
cyclistic_df_no_dups$hour <- format(lubridate :: ymd_hms(cyclistic_df_no_dups$started_at), "%H")

4. Saving our results into a csv file

In [None]:
write.csv(cyclistic_df_no_dups, "cyclistic_clean.csv")

**Guiding questions**

● What tools are you choosing and why? I am using R for my analysis because of the size of dataset and to get familiar with the language.

● Have you ensured your data’s integrity? Yes

● What steps have you taken to ensure that your data is clean? Removing duplicates and manuplating data.

● How can you verify that your data is clean and ready to analyze? By checking data types, any na values or some other missing information.

● Have you documented your cleaning process so you can review and share those results? I have saved it into a csv file.

**Key tasks**

1. Check the data for errors.
2. Choose your tools.
3. Transform the data so you can work with it effectively.
4. Document the cleaning process.

# **ANALYZE**

In [9]:
cyclistic <- cyclistic_df_no_dups
head(cyclistic)

1. **TOTAL MEMBERS**

What is the distribution of casual and annual members?

In [10]:
cyclistic %>%
    group_by(member_casual) %>%
    summarize(total_rides = length(ride_id), percentage = (length(ride_id)/nrow(cyclistic))*100)
    

In [11]:
ggplot(data = cyclistic, aes(x = member_casual, fill = member_casual)) + geom_bar()

Observation : There are more annual members than casual members (~10%)

**2. WEEKDAY**

How are the rides distributed accross different weekdays?

In [12]:
cyclistic %>%
    group_by(weekday) %>%
    summarize(total_rides = length(ride_id), 
              annual_members_percent = (sum(member_casual == "member")/length(ride_id)) * 100, 
              casual_members_percent = (sum(member_casual == "casual")/length(ride_id)) * 100)

In [13]:
ggplot(data = cyclistic, aes(x = weekday, fill = member_casual)) + geom_bar() 

Observation: 
1. Weekends have the highest number of rides
2. Casual members prefer to ride on weekends more which could imply that they use bikes for liesure activities.
3. Annual members prefer to ride on weekdays more which could imply that they use bikes to commute to work.

**3. RIDE LENGTH**

What is the general trend of ride length for annual and casual members?

In [14]:
cyclistic %>%
    group_by(member_casual) %>%
    summarize(avg_ride_length = mean(ride_length), min_ride_length = min(ride_length), max_ride_length = max(ride_length))

We observed that there are negative values in ride length.
Since we calculated ride_length as ended_at - started_at, this implies that in our data there are rows in which ended_at is less than started_at which could be an error while data feeding. 
We recheck this with our stakeholders and fix it.

In [15]:
ventiles = quantile(cyclistic$ride_length, seq(0, 1, by=0.05))
cyclistic_without_outliners <- cyclistic %>% 
    filter(ride_length > as.numeric(ventiles['5%'])) %>%
    filter(ride_length < as.numeric(ventiles['95%']))

In [16]:
ggplot(data = cyclistic_without_outliners, aes(x = member_casual, y = ride_length, fill = member_casual)) + geom_boxplot()

Observation: 
Casual members use bikes for more duration as compared to annual members because they are using bikes for leisure activites whereas annual members follow a fixed route to work.

**RIDE LENGTH AND WEEKDAYS**

How does this ride length vary across different weekdays?

In [17]:
ggplot(cyclistic_without_outliners, aes(x = weekday, y = ride_length, fill = member_casual)) + geom_boxplot() + 
facet_wrap(~member_casual) + coord_flip()

**5. MONTHS**

What are the seasonal trends of bike rides?

In [18]:
cyclistic %>%
    group_by(month_year) %>%
    summarize(total_rides = length(ride_id), 
              casual_member_percent = ((sum(member_casual == "casual")/length(ride_id))*100),
              annual_member_percent = ((sum(member_casual == "member")/length(ride_id))*100))

In [19]:
ggplot(cyclistic, aes(x = month_year, fill = member_casual)) + geom_bar() + coord_flip()

Observation: 
1. Months between May and October observe highest bike rides which implies that bike rides are least prefered during the winters.
2. The trend is same for both casual and annual members

**6. HOUR**

During which hours is bike riding most prefered?

In [20]:
ggplot(cyclistic, aes(x = hour, fill = member_casual)) + geom_bar() 

In [21]:
#hours and weekdays
ggplot(cyclistic, aes(x = hour, fill = member_casual)) + geom_bar() + facet_wrap(~weekday)

Obseravtion: 
1. Bike rides are prefered during afternoon.
2. This trend is same for all weekdays.
3. Casual members ride during weekends while annual members during business days.

**7. RIDEABLE TYPE**

Does the type of ride also affect the distribution?

In [22]:
cyclistic %>%
    group_by(rideable_type) %>%
    summarize(
        casual_member_percent = (sum(member_casual == "casual")/length(ride_id)) * 100 ,
        annual_member_percent = (sum(member_casual == "member")/length(ride_id)) * 100 )
    

In [23]:
ggplot(cyclistic, aes(x = rideable_type, fill = member_casual)) + geom_bar() + coord_flip()

Obseravtion: 
1. There is least preference of docked bikes(could be because the company does not manufacture enough of docked bikes)
2. There is more preference for classic bikes(which are also manufactured more in number)
3. Any conclusion is not certian based on the bike type.

**Guiding questions**

● How should you organize your data to perform analysis on it? The data is combined into one file for easy analysis

● Has your data been properly formatted? Yes 

● What surprises did you discover in the data? That tehre are several unexpected factors which makes casual riders different from annual riders.

● What trends or relationships did you find in the data? 
1. There are more annual members than casual mebers.
2. Bike rides are preferd more during weekends.
3. Casual members use bikes for more duration than annual members.
4. Bike riding is preferd during summers and spring.
5. Most rides take place in noon and evening across all weekdays.
6. Type of ride does not have any direct relation with type of members.

● How will these insights help answer your business questions? These insights will help me to find the difference between casual and annual memeners and in turn will help business to derive solutions for converting casual to annual members.

**Key tasks**

1. Aggregate your data so it’s useful and accessible.
2. Organize and format your data.
3. Perform calculations.
4. Identify trends and relationships.

# **SHARE**

In [None]:
#The analysis is shared via presentation.

**Guiding questions**

● Were you able to answer the question of how annual members and casual riders use Cyclistic bikes differently? Yes

● What story does your data tell? That both members use bikes for different reasons.

● How do your findings relate to your original question? I was able to find the differnece between both members which was our goal.

● Who is your audience? What is the best way to communicate with them? My audience is my stakeholders. The findings from this analysis might seem complex to them and they might not have the time to understand that. So the best way is to present my insight in the form of presentation with simple and easy to understand derivations and conclusions.

● Can data visualization help you share your findings? Yes it will make my analysis easy to comprehend.

● Is your presentation accessible to your audience? Yes

**Key tasks**

1. Determine the best way to share your findings.
2. Create effective data visualizations.
3. Present your findings.
4. Ensure your work is accessible.

# **ACT**

**Guiding questions**

● What is your final conclusion based on your analysis? Both members use bikes for different purposes which has direct relation with their membership. The casual members use bike for longer duration for liesure activitees whereas the annual members use it for lesser duartion as they are follow a fixed route for commuting to work.


● How could your team and business apply your insights? The insights can be applied to undersatnd the difference between the type of customers they will be addressing and devise effective strategies based on these findings to acheive the business goal.

● What next steps would you or your stakeholders take based on your findings? The next step will be use the findings of this analysis to make marketing strategies with an aim to convert casual members to annual members.

● Is there additional data you could use to expand on your findings? No

**Key tasks**

1. Create your portfolio.
2. Add your case study.
3. Practice presenting your case study to a friend or family member.

**TOP 3 RECOMMENDATIONS**

1. Since the casual members use the bikes for liesure purposes, the focus of the marketing campaigns should be to address the benifits of riding bikes for work purpose as well - promoting environment-friendly habit.
2. The marketing team should highlight the health advnatage of riding bikes on daily basis - thereby addressing the business goal.
3. A strategy needs to be devised to prevent revenue loss during winters.(Attractive offers etc)  