# Cyclistic Case Study
### Introduction:
This fictional case study is the *Final Capstone Project* in the *Google Data Analytics Certificate*. The data used will be the *divvy* public dataset. The topic is reguarding a bike share company in Chicago called Cyclistic. The bike share program generalizes the customers into two types of riders, <span style = "color:salmon;"> casual riders </span> that buys single-ride passes and/or full-day passes, and <span style = "color:cornflowerblue;"> annual members </span> that buys annual memberships. It has been concluded that annual memberships bring more profit to the company. With many customers already using this program, the company wants advertising strategies to transform casual riders into annual members.
### Scenario
The goal is to use the previous 12 months of trip data to analyze the differences between <span style = "color:salmon;"> casual riders </span> and <span style = "color:cornflowerblue;"> annual members </span>. Determine the key aspects needed for <span style = "color:salmon;"> casual riders </span> to buy an annual membership according to data insights. Use key conclusions to develop new and effective marketing strategies in order to convert <span style = "color:salmon;"> casual riders </span> to <span style = "color:cornflowerblue;"> annual members </span> and present the strategies.
##### Questions to answer:
1. What are the differences between the two rider groups
2. What is needed to transform <span style = "color:salmon;"> casual riders </span> into <span style = "color:cornflowerblue;"> annual members </span>
3. What are some strategies to implement

In [None]:
library(tidyverse) # Load needed libraries
library(lubridate)
library(ggplot2)
library(plyr)
library(dplyr)

I have added the datasets from Kaggle, therefore I changed the directory to read the files and changed the directory back later in order to perform analysis correctly. I saved all the file names to a list and appended each of the files to form a data frame consisting of all the data.

In [None]:
setwd("/kaggle/input/")                     # Change directory to read dataset added
mydir = "cyclistic-trips-202108-to-202207"
myfiles = list.files(path = mydir, pattern = "*.csv", full.names = TRUE) 
                                            # Create a list of file names
data = ldply(myfiles, read_csv)             # Append datasets into one
setwd('/kaggle/working')                    # Change directory back

I used the function `setNames` set key/value pairs in order to map the month to the season.

In [None]:
months = setNames(c(rep("Winter", 2), rep("Spring", 3), 
                    rep("Summer", 3), rep("Fall", 3), "Winter"), month.name)
months # Output a map that matches months to the correct season

1. I added a few more columns using the `mutate` function in order to perform analysis,
* `trip_duration` - the time difference between the starting time and the ending time
* `day_of_week` - use the `weekdays` function to find the day of the week of the trip
* `month` - use the `month.name` function to find the month of the trip
* `season` - use the map created previously to find the season of the trip

2. I filtered out the trip duration that is equal or lower than 0 because that is an invalid trip time so I assume that there is a mistake in the recording of the data.

3. I arranged the trip duration by descending order to better look at the data.

4. I looked at the head of the new data frame to check if everything looks correct.

In [None]:
tripdata = data %>%                                        # Give df new name
    mutate(trip_duration = difftime(ended_at, started_at), # Calculate trip duration
           day_of_week = weekdays(started_at),             # Find day of week
           month = month.name[month(started_at)],          # Find name of month
           season = months[month]) %>%                     # Find name of season
    filter(trip_duration > 0) %>%                          # Make sure time is positive
    arrange(desc(trip_duration))                           # Order by trip duration
head(tripdata)                                             # Look at df

In [None]:
colors = c("salmon", "cornflowerblue") # Choose colors for graphs
green = "darkseagreen"

This graph shows two boxplots of different colors representing each rider group. <span style = "color:salmon;"> Orange </span> is casual riders and  <span style = "color:cornflowerblue;"> Blue </span> is annual member .

* There is an enourmous amount of outliers, they have to be removed in order to see the statistics better.

In [None]:
ggplot(tripdata, aes(y = trip_duration, x = member_casual, color = member_casual)) + 
    geom_boxplot() + 
    scale_color_manual(values = colors)

After grahing and removing the outliers, we can say a few things about the trip duration of the <span style = "color:salmon;"> casual riders </span> group. 
* Higher median and wider range
* From the previous graph: there are a lot of outliers
* The mean which are shown by the diamonds are much higher than the median, which means that the data is very skewed (which is also true for the <span style = "color:cornflowerblue;"> annual members </span> group)

In [None]:
ggplot(tripdata, aes(y = trip_duration, x = member_casual, color = member_casual)) + 
    geom_boxplot(outlier.shape = NA) + 
    coord_cartesian(ylim = quantile(tripdata$trip_duration, c(0.1, 0.9))) + 
    stat_summary(fun.y = mean, geom = "point", shape = 23, size = 10) + 
    scale_color_manual(values = colors)

After looking at the summary statistics using boxplots, I want to see if the outliers from the <span style = "color:salmon;"> casual riders </span> group is systematic by looking at each month.
* From the graph below, we can tell that the outliers are  systematic because there are exceptional amounts of outliers for all months. 

In [None]:
ggplot(tripdata, aes(y = trip_duration, x = member_casual, color = member_casual)) + 
    geom_boxplot() +
    facet_wrap(~month) + 
    scale_color_manual(values = colors)

Looking at the mean, median, and max of the trip durations, I could tell that the mean is much higher than the median, therefore I am more certain that I should remove the outliers before conducting further analysis.

In [None]:
summary = tripdata %>% 
    summarise(mean(trip_duration), median(trip_duration), max(trip_duration))
summary

Since I have already removed the lower end of the data by removing the trip durations that is lower than or equal to 0, and I see that there is a large number of high outliers accoridng to the summaries, I decided to remove only the higher outliers.

In [None]:
quantile = quantile(tripdata$trip_duration, probs = 0.75) # Find third quartile
iqr = IQR(tripdata$trip_duration)                         # Find interquartile range
upper = quantile + 1.5 * iqr                              # Calculate upper bound

Then I filtered the data to remove the outliers according to the upper bound, and looked at the first few rows to make sure that the process is done correctly.

In [None]:
new_tripdata = tripdata %>%       # Name new df
    filter(trip_duration < upper) # Filter out outliers using upper bound
head(new_tripdata)                # Look at first few rows

I graphed box plots again after removing the outliers, now the box plot seems much more normal while preserving the basic summaries and differences between the two groups in order to work with the dataset better.
* <span style = "color:salmon;"> Casual riders </span> maintain a higher median and wider range and <span style = "color:cornflowerblue;"> annual members </span> maintain a tigher box
* Data is normally distributed because means and medians are similar to one another

In [None]:
ggplot(new_tripdata, aes(y = trip_duration, x = member_casual, color = member_casual)) + 
    geom_boxplot() + 
    stat_summary(fun.y = mean, geom = "point", shape = 23, size = 10) + 
    scale_color_manual(values = colors)

Looking futher on the data for each month, I am more certain that the general trend is preserved.

In [None]:
ggplot(new_tripdata, aes(y = trip_duration, x = member_casual, color = member_casual)) + 
    geom_boxplot() +
    facet_wrap(~month) + 
    scale_color_manual(values = colors)

I have to detach package `plyr` and use library `dplyr` because there is a conflict on the function `group_by`, therefore this next step is required to use the function `group_by` correctly.

In [None]:
detach(package:plyr)    
library(dplyr)

I decided to plot line graphs instead of bar graphs because I observed that <span style = "color:salmon;"> casual riders </span> usually have higher trip durations and lower counts while <span style = "color:cornflowerblue;"> annual members </span> usually have lower trip durations and higher counts. If bar graphs were used, it would be obvious that one group have distinctly higher/lower statistics than the other group. It would be more useful to compare the two groups by using line graphs which shows the differences throughout the days of the week.
<br> <br>
First I created a list of days of the week so that later I can reorder the days so that the order makes sense.

In [None]:
week_days = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")

This line graph has days of the week on the x-axis and trip counts on the y-axis. Each dot shows the trip counts of one day of the week. The days are ordered from Sunday to Saturday in order to see how the counts changes thoughout a week's period.

* The general trend is that there are higher counts on weekends and lower counts on weekdays, while there is an upward trend from mondays to thursdays

In [None]:
days = new_tripdata %>%
    group_by(day_of_week) %>% 
    count()

days$day_of_week = factor(days$day_of_week, levels = week_days) # Set order of days
days = days[order(days$day_of_week), ]                          # Reorder days of week

ggplot(days, aes(day_of_week, n, group = 1)) + 
    geom_line(color = green, size = 1.5) + geom_point(color = green, size = 3)

The only difference of this graph from the one above is that the y-axis is the average trip duration in the graph below.

* The general trend is that there are higher durations on weekends and lower durations on weekdays, the shape is an upward-facing parabola with Tuesday as the minimum

In [None]:
days = new_tripdata %>%
    group_by(day_of_week) %>% 
    summarise(avg_time = mean(trip_duration))

days$day_of_week = factor(days$day_of_week, levels = week_days)
days = days[order(days$day_of_week), ]

ggplot(days, aes(day_of_week, avg_time, group = 1)) + 
    geom_line(color = green, size = 1.5) + geom_point(color = green, size = 3)

The two rider groups have different and almost completely opposite trends.
* <span style = "color:salmon;"> Casual riders </span> have extremely higher counts on weekends with an upward trend on weekdays, there is a upward-facing parabola shape with a min at tuesday.
* <span style = "color:cornflowerblue;"> Annual members </span> have lowest counts on weekends, slightly lower counts near weekends, and stable high counts from tuesday to thursday, which shows an upside-down parabola with a flat peak.

In [None]:
rider_days = new_tripdata %>%
    group_by(day_of_week, member_casual) %>% 
    count()

rider_days$day_of_week = factor(rider_days$day_of_week, levels = week_days)
rider_days = rider_days[order(rider_days$day_of_week), ]

ggplot(rider_days, aes(day_of_week, n, group = member_casual, color = member_casual)) + 
    geom_line(size = 1.5) + geom_point(size = 3) + 
    scale_color_manual(values = colors)

Now looking at the average trip duration,
* The two groups have a similar trend of upward facing parabolas with weekends having higher durations with minimums on Tuesday
* The <span style = "color:salmon;"> casual riders </span> group have much higher trip durations reguardless of day

In [None]:
rider_days = new_tripdata %>%
    group_by(day_of_week, member_casual) %>% 
    summarise(avg_time = mean(trip_duration))

rider_days$day_of_week = factor(rider_days$day_of_week, levels = week_days)
rider_days = rider_days[order(rider_days$day_of_week), ]

ggplot(rider_days, aes(day_of_week, avg_time, group = member_casual, color = member_casual)) + 
    geom_line(size = 1.5) + geom_point(size = 3) + 
    scale_color_manual(values = colors)

Next, I decided to look at the overall data seperated into months, to make the graph easier to look at, I also seperated each season into seperate grids. Each color represents a month and each grid is a season.

* There are lowest counts in winter months and higher counts on summer months while fall and spring have moderate counts
* Summer and winter months tend to have less variation among the days of the week compared to fall and spring months.

In [None]:
month_colors = c("slateblue2", "lightblue", # Set colors for each month
                 "mediumturquoise", "mediumspringgreen", "yellowgreen", 
                 "tomato", "orange", "gold", 
                 "lightslategrey", "chocolate", "violet", 
                 "mediumpurple")

In [None]:
seasons = new_tripdata %>%
    group_by(day_of_week, month, season) %>% 
    count()

seasons$day_of_week = factor(seasons$day_of_week, levels = week_days)
seasons$month = factor(seasons$month, levels = month.name)

seasons = seasons[order(seasons$day_of_week), ]
seasons = seasons[order(seasons$month), ]

ggplot(seasons, aes(day_of_week, n, group = month, color = month)) + 
    geom_line(size = 1) + geom_point(size = 1.2) + facet_wrap(~season) + 
    theme(axis.text.x = element_text(angle = 90)) + 
    scale_color_manual(values = month_colors)

Now looking at the average trip durations, we can see similar trends to the graph above on the counts.
* Winter tends to have lower durations, summer tends to have higher durations while fall and spring have moderate durations
* Each season have similar trends among the months that it has
* Most shows upward parabola shapes with winter having flatter lines
<br>
<br>
Interesting point: In March, wednesday has a much higher average than the days besides it

In [None]:
seasons = new_tripdata %>%
  group_by(day_of_week, month, season) %>% 
  summarise(avg_time = mean(trip_duration))

seasons$day_of_week = factor(seasons$day_of_week, levels = week_days)
seasons$month = factor(seasons$month, levels = month.name)

seasons = seasons[order(seasons$day_of_week), ]
seasons = seasons[order(seasons$month), ]

ggplot(seasons, aes(day_of_week, avg_time, group = month, color = month)) + 
    geom_line(size = 1) + geom_point(size = 1.2) + facet_wrap(~season) + 
    theme(axis.text.x = element_text(angle = 90)) + 
    scale_color_manual(values = month_colors)

Next, I looked at line graphs showing the same thing but this time seperated by rider type. The top row of the grid shows <span style = "color:salmon;"> casual riders </span> while the bottom shows <span style = "color:cornflowerblue;"> annual members </span>.
* Although the two rider groups have different trends within the week, they have however similar overall counts for each month.

In [None]:
rider_seasons = new_tripdata %>%
    group_by(day_of_week, month, season, member_casual) %>% 
    count()

rider_seasons$day_of_week = factor(rider_seasons$day_of_week, levels = week_days)
rider_seasons$month = factor(rider_seasons$month, levels = month.name)

rider_seasons = rider_seasons[order(rider_seasons$day_of_week), ]
rider_seasons = rider_seasons[order(rider_seasons$month), ]

ggplot(rider_seasons, aes(day_of_week, n, group = month, color = month)) + 
    geom_line(size = 1) + geom_point(size = 1.2) + 
    facet_wrap(member_casual~season, ncol = 4) + 
    theme(axis.text.x = element_text(angle = 90)) + 
    scale_color_manual(values = month_colors)

Next, I created grid plots for each season with seperate lines for rider type to observe the overall trends on each season by rider type. Each graph in the grid is a season and each color is a rider type.

* There are similar trends thoughout each season within the same rider type. <span style = "color:salmon;"> Casual riders </span> shows upward-facing parabolas with flatter minimum while <span style = "color:cornflowerblue;"> annual members </span> shows downward-facing parabolas with sharper maximums
* <span style = "color:cornflowerblue;"> Annual members </span> have higher or equal counts except for summer when <span style = "color:salmon;"> casual riders </span> show higher counts on weekends
* There are clear parabola patterns except for winter when the counts are almost stable thoughout the week

In [None]:
rider_only_seasons = new_tripdata %>%
    group_by(day_of_week, season, member_casual) %>% 
    count()

rider_only_seasons$day_of_week = factor(rider_only_seasons$day_of_week, levels = week_days)
rider_only_seasons = rider_only_seasons[order(rider_seasons$day_of_week), ] %>%
    na.omit(rider_only_seasons)

ggplot(rider_only_seasons, aes(day_of_week, n, group = member_casual, color = member_casual)) + 
    geom_line(size = 1) + geom_point(size = 1.2) + facet_wrap(~season) + 
    theme(axis.text.x = element_text(angle = 90)) + 
    scale_color_manual(values = colors)

Next I decided to look at some of the geophrphical locations to determine areas that include popular stations.
<br>
First I removed the single outlier, then I sampled the data to 1000 random points so that the graph would not be packed.

In [None]:
remove_outlier = new_tripdata %>%
    filter(start_lat < 42.5)
rand_samp = remove_outlier[sample(nrow(remove_outlier), size = 1000), ]

The graph below shows the geographical location of start stations. The x-axis is the lattitude and the y-axis is the longitude. Each rider group is represented by the same color. A larger dot means that the station is more popular, while a smaller dot would mean the opposite.

In [None]:
ggplot(rand_samp, aes(x = start_lat, y = start_lng, color = member_casual)) +
    geom_count(alpha = 0.3) + 
    scale_color_manual(values = colors) +
    geom_rect(aes(xmin = 41.86, xmax = 41.91, ymin = -87.6275, ymax = -87.605), 
              fill = NA, color = "mediumorchid")

Below shows the same information put with end stations.
* From both plots I have highlighted an area with <span style = "color:mediumorchid;"> purple </span>, this area seems to be more popular among <span style = "color:salmon;"> casual riders </span>.

In [None]:
ggplot(rand_samp, aes(x = end_lat, y = end_lng, color = member_casual)) +
    geom_count(alpha = 0.3) + 
    scale_color_manual(values = colors) +
    geom_rect(aes(xmin = 41.86, xmax = 41.91, ymin = -87.6275, ymax = -87.605), 
              fill = NA, color = "mediumorchid")

Some of the popular stations are found below.

In [None]:
popular_stations = remove_outlier %>%
    group_by(start_station_name, member_casual) %>%
    count() %>%
    arrange(desc(n)) %>%
    na.omit()
head(popular_stations)

### Conclusion
With these trends, I could tell that <span style = "color:cornflowerblue;"> annual members </span> are mostly people who use this service to make short commutes to work or school on a regular basis, which explains why there are more trips but a lower trip duration as well as a more constant trip duration and counts on weekdays. On the other hand, <span style = "color:salmon;"> casual riders </span> are people who like to take trips weekends, maybe for exercise or to meet with others occasionally. They probably have ways to commute to work or school on a regular basis so they use the service on weekdays less often.
<br>
This story explains the original question of the differences between the two types of riders, with the main insight being <span style = "color:cornflowerblue;"> annual members </span> use the service for work on weekdays while <span style = "color:salmon;"> casual riders </span> use the service for extracurricular activities on weekends.

##### Some points that would make casual riders consider buying annual membership:
1. Converting into using this bike share program to commute to work or school on a regular basis
2. Making the annual membership worth it even only used occasionally for extracurricualar activities
3. More convenience or better prices/deals

##### Possible strategies:
1. **Adding seasonal/monthly memberships:** Due to the different trends for different seasons, customers might be lured to trying out a seasonal membership if they know they would more likely use the program more during a certain season, summer per se. Monthly memberships could be useful for customers to try commuting to work/school using the bike share program, giving them a test period and allowing flexibility.
2. **Incorporating a rewards system:** A rewards system that could provide for certain discounts/deals could increase customers' comeback rates. Customers could collect points from using the service each time, increasing the sense of engagement. This system would encourage customers to use the service more often which would more likely result in a purchase of an annual membership.
3. **Adding extra stations:** Adding additional stations near popular spots among <span style = "color:salmon;"> casual riders </span> could also increase their use of the service which might turn them into <span style = "color:cornflowerblue;"> annual members </span> (`popular_stations` dataframe from above). Areas such as the <span style = "color:mediumorchid;"> highlighted </span> area in the scatter plot above could be great areas to add stations to increase the convenience of both groups of riders.