**Setting up environment.**

In [None]:
install.packages('tidyverse')

In [None]:
install.packages('lubridate')
install.packages('geosphere')

In [None]:
library('tidyverse')
library("ggplot2")
library("lubridate")
library("geosphere")

# Loading dataset

In [None]:
df1 <- read.csv("../input/cyclist/data/202009-divvy-tripdata.csv")
df2 <- read.csv("../input/cyclist/data/202010-divvy-tripdata.csv")
df3 <- read.csv("../input/cyclist/data/202011-divvy-tripdata.csv")
df4 <- read.csv("../input/cyclist/data/202012-divvy-tripdata.csv")
df5 <- read.csv("../input/cyclist/data/202101-divvy-tripdata.csv")
df6 <- read.csv("../input/cyclist/data/202102-divvy-tripdata.csv")
df7 <- read.csv("../input/cyclist/data/202103-divvy-tripdata.csv")
df8 <- read.csv("../input/cyclist/data/202104-divvy-tripdata.csv")
df9 <- read.csv("../input/cyclist/data/202105-divvy-tripdata.csv")
df10 <- read.csv("../input/cyclist/data/202106-divvy-tripdata.csv")
df11 <- read.csv("../input/cyclist/data/202107-divvy-tripdata.csv")
df12 <- read.csv("../input/cyclist/data/202108-divvy-tripdata.csv")


In [None]:
bike_rides <- rbind(df1, df2, df3, df4, df5, df6, df7, df8, df9, df10, df11, df12)

In [None]:
dim(bike_rides)

The first thing I want to do before diving into the data and analyzing it is to remove all duplicates.

# Preprocessing and cleaning the data

In [None]:
summary(bike_rides)

 I see that the started_at column is a string format in order to do the calculations I will be converting it to date time format, and I also added a column for the date of the travel as well as start and end hour,  which will give me the time in hours in the started_at column.

In [None]:
bike_rides$Ymd  <- as.Date(bike_rides$started_at)

bike_rides$started_at <- lubridate::ymd_hms(bike_rides$started_at)
bike_rides$ended_at <- lubridate::ymd_hms(bike_rides$ended_at)

bike_rides$start_hour <- lubridate::hour(bike_rides$started_at)
bike_rides$end_hour <- lubridate::hour(bike_rides$ended_at)

In [None]:
cyclistic <- bike_rides[!duplicated(bike_rides$ride_id), ]
print(paste("Removed", nrow(bike_rides) - nrow(cyclistic), "duplicated rows"))

I removed the duplicated rows and now I want to remove the rows containing na or null values.

In [None]:
cyclistic <- drop_na(cyclistic)
cyclistic$ride_time <- as.numeric(cyclistic$ended_at - cyclistic$started_at) / 60

In [None]:
head(cyclistic)

In [None]:
summary(cyclistic$ride_time)

There are some negative values in the ride time which logically does not  makes sense, so I will be dropping them to avoid confusion.

In [None]:
cyclistic <- cyclistic %>% filter(ride_time > 0)

In [None]:
cyclistic$day_of_week <- format(as.Date(cyclistic$Ymd), "%A")

Now I want to calculate ride speed and ride distance. To do that I will be using lat and long cordinates along with time. (speed=distance/time)

In [None]:
cyclistic$ride_distance <- distGeo(matrix(c(cyclistic$start_lng, cyclistic$start_lat), ncol = 2), matrix(c(cyclistic$end_lng, cyclistic$end_lat), ncol = 2))
cyclistic$ride_distance <- cyclistic$ride_distance/1000

cyclistic$ride_speed = c(cyclistic$ride_distance / as.numeric(cyclistic$ride_time) *(100))

cyclistic$month <- strftime(cyclistic$started_at, "%m")

In [None]:
ride_count_start_station <- cyclistic %>%
    group_by(start_station_name) %>% 
    summarise(ride_count = length(start_station_id))

In [None]:
cyclistic %>%
  write.csv("cyclistic_clean.csv")

# Analysing the data

In [None]:
ggplot(cyclistic, aes(member_casual, fill=member_casual)) +
    geom_bar() +
    labs(x="Casuals x Members", title="Casuals Vs Members distribution")

Members dominate in the count. Trying to dig deeper by using groupby.

In [None]:
cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(count = length(ride_id))

In [None]:
# monthly report
ggplot(cyclistic, aes(month, fill=member_casual)) +
    geom_bar(,position=position_dodge()) +
    labs(x="months", title="No of rides on weekdays") +
    coord_flip()

In [None]:
# weekday report
ggplot(cyclistic, aes(day_of_week, fill=member_casual)) +
    geom_bar(, position=position_dodge()) +
    labs(x="weekdays", title="No of rides on weekdays")

We can see there are more casual members on Saturday and Sunday.

In [None]:
cyclistic %>%
    ggplot(aes(start_hour, fill=member_casual)) +
    labs(x="Hour of the day", title="") +
    geom_bar(position=position_dodge())

From the above graph we can conclude that most people start their cycle ride around 5 pm. The afternoon hours dominates for the most part, while less people ride in the morning hours.

In [None]:
names(cyclistic)

In [None]:
new_df <- cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(mean_time = mean(ride_time),mean_distance = mean(ride_distance))
new_df

In [None]:
new_df1 <- cyclistic %>% 
    group_by(member_casual) %>% 
    summarise(median_time = median(ride_time),median_distance = median(ride_distance))
new_df1

There is a significant difference between mean and median ride time, it may be due to some outliers so for this reason let's just look at the mean.

In [None]:
ggplot(new_df, aes(x=member_casual, y = mean_time, fill = member_casual)) +
    geom_col(,position=position_dodge()) +
    labs(x="members_casual", title="Mean Time members vs casual")

In [None]:
ggplot(new_df, aes(x=member_casual, y = mean_distance, fill = member_casual)) +
    geom_col(,position=position_dodge()) +
    labs(x="members_casual", title="Mean distance members vs casual")

In [None]:
names(cyclistic)

In [None]:
cyclistic %>%
    ggplot(aes(rideable_type, fill=member_casual)) +
    labs(x="rideable type", title="Distribution of rideable_type") +
    geom_bar(position=position_dodge())

From the above data visualization, we have gained some excellent insights which will help us answer the business questions for our stakeholders. Casual or member both use our services heavly on weekends, but lets now focus on casuals they use our bikes for time off purposes and they use our service more in terms of time even though there is a significant difference between no of casuals and members. Casual tend to use more classic and electical cycles.

Annual or People who hold membership generally use our services for work related, and it also makes sense they will be using it for work purpose annual subscription was their go to option, they use classic and electical bikes to good extent.

so we can attract casuals and convert them to members, by giving some good offers for classic and electric bikes and for weekends also. we can also persue casuals to annual membership by giving some coupons or offers regarding their ride time, since from the analysis we know that they travel more, this may attract some croud to membership.

And also we know the popular start stations and routes which riders are taking so we can add banners in those routes. Also we can increase price of bikes on weekends since we know casuals tend to use our service more on weekends. We can also give special perks to members which also help us convert casuals to members.