## Cyclistic biking service case study
## Walter Mudavanhu Nyamutamba
## Data Analyst
## 03/01/2023
### Case Study: How Does a Bike-Share Navigate Speedy Success?¶

# Introduction
This is my work to solve the problem from Google Data Analytics Capstone: Complete a Case Study called "Cyclistic".

I will follow six steps of data analysis process: 
#### Ask, 
#### Prepare, 
#### Process, 
#### Analyze, 
#### Share 
#### Act. 

#### Each step will follow the Case Study Roadmap:

# Content

### About Dataset

This dataset comes from the historical data files of the bike-sharing company that record the ride information of every single ride in terms of start/end time, start/end station, and rider's type/gender/age.

## Acknowledgements
Motivate International Inc. (“Motivate”) operates the City of Chicago’s (“City”) Divvy bicycle-sharing service. Motivate and the City are committed to supporting bicycling as an alternative transportation option. As part of that commitment, the City permits Motivate to make certain Divvy system data owned by the City (“Data”) available to the public.

## Goal of the project
Understand how casual riders and annual members use Cyclistic differentily.
From these insights, the team will design marketing strategy to convert casual riders into annual members
##### How to maximize the number of annual members?



# ASK

Scenario

You are a junior data analyst working in the marketing analyst team at Cyclistic, a bike-share company in Chicago. The director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, your team wants to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, your team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve your recommendations, so they must be backed up with compelling data insights and professional data visualizations.


Characters and teams   

 Cyclistic: A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
 Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaignsand initiatives to promote the bike-share program. These may include email, social media, and other channels.
 Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
 Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.


 About the company

 In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.

Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.


## Guiding questions
What is the problem you are trying to solve?

The main objective of this case study is to find the way to turn casual bike riders to annual members

How can your insights drive business decisions?
The insights will help the team increase the number of annual members.

## This report also seeks to identify the important stakeholders that are involved in the overall analysis.

#### Stakeholders

_cyclistic users_

_director of marketing_

_Cyclistic marketing team_

_Cyclistic executive team_

# Prepare

This project will use dataset provided by Kaggle

### Guiding questions

● Where is your data located?

● How is the data organized?

● Are there issues with bias or credibility in this data? Does your data ROCCC?

● How are you addressing licensing, privacy, security, and accessibility?

● How did you verify the data’s integrity?

● How does it help you answer your question?

● Are there any problems with the data?

Key tasks

1. Download data and store it appropriately.

2. Identify how it’s organized.

3. Sort and filter the data.

4. Determine the credibility of the data.

Deliverable

A description of all data sources used

# Process
This step will cleanand make the dataset ready for the next phase: Analyses. And all the files will be merged into one file.

## Load Packages

In [1]:
library(tidyverse)
#tidyverse is standard data analyst library(data manipulation,exploration and visualisation)

library(dplyr)
#complete common data manipulation

library(skimr)
#helps in summarizing the data

library(readr)
#import csv files

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.5 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


# Import data
The data is for the year 2021

In [2]:
dat2021January <-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202101-divvy-tripdata/202101-divvy-tripdata.csv')
dat2021February<-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202102-divvy-tripdata/202102-divvy-tripdata.csv')
dat2021March<-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202103-divvy-tripdata/202103-divvy-tripdata.csv')
dat2021April <-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202104-divvy-tripdata/202104-divvy-tripdata.csv')
dat2021May <-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202105-divvy-tripdata/202105-divvy-tripdata.csv')
dat2021June <- read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202106-divvy-tripdata/202106-divvy-tripdata.csv')
dat2021July <- read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202107-divvy-tripdata/202107-divvy-tripdata.csv')
dat2021August <-read.csv( '/kaggle/input/cyclistic-case-study-google-certificate/202108-divvy-tripdata/202108-divvy-tripdata.csv')
dat2021September<- read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202109-divvy-tripdata/202109-divvy-tripdata.csv')
dat2021October <- read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202110-divvy-tripdata/202110-divvy-tripdata.csv')
dat2021November<-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202111-divvy-tripdata/202111-divvy-tripdata.csv')
dat2021December<-read.csv('/kaggle/input/cyclistic-case-study-google-certificate/202112-divvy-tripdata/202112-divvy-tripdata.csv')

## Getting to know your data
first check if our 12 datasets have matching fields and datatypes

In [3]:
str(dat2021January)

'data.frame':	96834 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -87

In [4]:
str(dat2021February)

'data.frame':	49622 obs. of  13 variables:
 $ ride_id           : chr  "89E7AA6C29227EFF" "0FEFDE2603568365" "E6159D746B2DBB91" "B32D3199F1C2E75B" ...
 $ rideable_type     : chr  "classic_bike" "classic_bike" "electric_bike" "classic_bike" ...
 $ started_at        : chr  "2021-02-12 16:14:56" "2021-02-14 17:52:38" "2021-02-09 19:10:18" "2021-02-02 17:49:41" ...
 $ ended_at          : chr  "2021-02-12 16:21:43" "2021-02-14 18:12:09" "2021-02-09 19:19:10" "2021-02-02 17:54:06" ...
 $ start_station_name: chr  "Glenwood Ave & Touhy Ave" "Glenwood Ave & Touhy Ave" "Clark St & Lake St" "Wood St & Chicago Ave" ...
 $ start_station_id  : chr  "525" "525" "KA1503000012" "637" ...
 $ end_station_name  : chr  "Sheridan Rd & Columbia Ave" "Bosworth Ave & Howard St" "State St & Randolph St" "Honore St & Division St" ...
 $ end_station_id    : chr  "660" "16806" "TA1305000029" "TA1305000034" ...
 $ start_lat         : num  42 42 41.9 41.9 41.8 ...
 $ start_lng         : num  -87.7 -87.7 -87.6 -87.7 

In [5]:
str(dat2021March)

'data.frame':	228496 obs. of  13 variables:
 $ ride_id           : chr  "CFA86D4455AA1030" "30D9DC61227D1AF3" "846D87A15682A284" "994D05AA75A168F2" ...
 $ rideable_type     : chr  "classic_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : chr  "2021-03-16 08:32:30" "2021-03-28 01:26:28" "2021-03-11 21:17:29" "2021-03-11 13:26:42" ...
 $ ended_at          : chr  "2021-03-16 08:36:34" "2021-03-28 01:36:55" "2021-03-11 21:33:53" "2021-03-11 13:55:41" ...
 $ start_station_name: chr  "Humboldt Blvd & Armitage Ave" "Humboldt Blvd & Armitage Ave" "Shields Ave & 28th Pl" "Winthrop Ave & Lawrence Ave" ...
 $ start_station_id  : chr  "15651" "15651" "15443" "TA1308000021" ...
 $ end_station_name  : chr  "Stave St & Armitage Ave" "Central Park Ave & Bloomingdale Ave" "Halsted St & 35th St" "Broadway & Sheridan Rd" ...
 $ end_station_id    : chr  "13266" "18017" "TA1308000043" "13323" ...
 $ start_lat         : num  41.9 41.9 41.8 42 42 ...
 $ start_lng         : num  -

In [6]:
str(dat2021April)

'data.frame':	337230 obs. of  13 variables:
 $ ride_id           : chr  "6C992BD37A98A63F" "1E0145613A209000" "E498E15508A80BAD" "1887262AD101C604" ...
 $ rideable_type     : chr  "classic_bike" "docked_bike" "docked_bike" "classic_bike" ...
 $ started_at        : chr  "2021-04-12 18:25:36" "2021-04-27 17:27:11" "2021-04-03 12:42:45" "2021-04-17 09:17:42" ...
 $ ended_at          : chr  "2021-04-12 18:56:55" "2021-04-27 18:31:29" "2021-04-07 11:40:24" "2021-04-17 09:42:48" ...
 $ start_station_name: chr  "State St & Pearson St" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Honore St & Division St" ...
 $ start_station_id  : chr  "TA1307000061" "KA1503000069" "20121" "TA1305000034" ...
 $ end_station_name  : chr  "Southport Ave & Waveland Ave" "Dorchester Ave & 49th St" "Loomis Blvd & 84th St" "Southport Ave & Waveland Ave" ...
 $ end_station_id    : chr  "13235" "KA1503000069" "20121" "13235" ...
 $ start_lat         : num  41.9 41.8 41.7 41.9 41.7 ...
 $ start_lng         : num 

In [7]:
str(dat2021May)

'data.frame':	531633 obs. of  13 variables:
 $ ride_id           : chr  "C809ED75D6160B2A" "DD59FDCE0ACACAF3" "0AB83CB88C43EFC2" "7881AC6D39110C60" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-05-30 11:58:15" "2021-05-30 11:29:14" "2021-05-30 14:24:01" "2021-05-30 14:25:51" ...
 $ ended_at          : chr  "2021-05-30 12:10:39" "2021-05-30 12:14:09" "2021-05-30 14:25:13" "2021-05-30 14:41:04" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.6 -87.6 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.8 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.6 -87.6 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr  "casual" "casual" "casual" "casual" ...


In [8]:
str(dat2021June)

'data.frame':	729595 obs. of  13 variables:
 $ ride_id           : chr  "99FEC93BA843FB20" "06048DCFC8520CAF" "9598066F68045DF2" "B03C0FE48C412214" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-06-13 14:31:28" "2021-06-04 11:18:02" "2021-06-04 09:49:35" "2021-06-03 19:56:05" ...
 $ ended_at          : chr  "2021-06-13 14:34:11" "2021-06-04 11:24:19" "2021-06-04 09:55:34" "2021-06-03 20:21:55" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.8 41.8 41.8 41.8 41.8 ...
 $ start_lng         : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ end_lat           : num  41.8 41.8 41.8 41.8 41.8 ...
 $ end_lng           : num  -87.6 -87.6 -87.6 -87.6 -87.6 ...
 $ member_casual     : chr  "member" "member" "member" "member" ...


In [9]:
str(dat2021July)

'data.frame':	822410 obs. of  13 variables:
 $ ride_id           : chr  "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
 $ rideable_type     : chr  "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
 $ started_at        : chr  "2021-07-02 14:44:36" "2021-07-07 16:57:42" "2021-07-25 11:30:55" "2021-07-08 22:08:30" ...
 $ ended_at          : chr  "2021-07-02 15:19:58" "2021-07-07 17:16:09" "2021-07-25 11:48:45" "2021-07-08 22:23:32" ...
 $ start_station_name: chr  "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "13001" "17660" "SL-012" "17660" ...
 $ end_station_name  : chr  "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
 $ end_station_id    : chr  "KA1504000117" "13432" "KA1503000044" "13196" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.6 -87.

In [10]:
str(dat2021August)

'data.frame':	804352 obs. of  13 variables:
 $ ride_id           : chr  "99103BB87CC6C1BB" "EAFCCCFB0A3FC5A1" "9EF4F46C57AD234D" "5834D3208BFAF1DA" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-08-10 17:15:49" "2021-08-10 17:23:14" "2021-08-21 02:34:23" "2021-08-21 06:52:55" ...
 $ ended_at          : chr  "2021-08-10 17:22:44" "2021-08-10 17:39:24" "2021-08-21 02:50:36" "2021-08-21 07:08:13" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.8 41.8 42 42 41.8 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.6 ...
 $ end_lat           : num  41.8 41.8 42 42 41.8 ...
 $ end_lng           : num  -87.7 -87.6 -87.7 -87.7 -87.6 ...
 $ member_casual     : chr  "member" "member" "member" "member" ...


In [11]:
str(dat2021September)

'data.frame':	756147 obs. of  13 variables:
 $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
 $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
 $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
 $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr  "casual" "casual" "casual" "casual" ...


In [12]:
str(dat2021September)

'data.frame':	756147 obs. of  13 variables:
 $ ride_id           : chr  "9DC7B962304CBFD8" "F930E2C6872D6B32" "6EF72137900BB910" "78D1DE133B3DBF55" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-09-28 16:07:10" "2021-09-28 14:24:51" "2021-09-28 00:20:16" "2021-09-28 14:51:17" ...
 $ ended_at          : chr  "2021-09-28 16:09:54" "2021-09-28 14:40:05" "2021-09-28 00:23:57" "2021-09-28 15:00:06" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.8 41.8 41.9 ...
 $ start_lng         : num  -87.7 -87.6 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 42 41.8 41.8 41.9 ...
 $ end_lng           : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr  "casual" "casual" "casual" "casual" ...


In [13]:
str(dat2021October)

'data.frame':	631226 obs. of  13 variables:
 $ ride_id           : chr  "620BC6107255BF4C" "4471C70731AB2E45" "26CA69D43D15EE14" "362947F0437E1514" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-10-22 12:46:42" "2021-10-21 09:12:37" "2021-10-16 16:28:39" "2021-10-16 16:17:48" ...
 $ ended_at          : chr  "2021-10-22 12:49:50" "2021-10-21 09:14:14" "2021-10-16 16:36:26" "2021-10-16 16:19:03" ...
 $ start_station_name: chr  "Kingsbury St & Kinzie St" "" "" "" ...
 $ start_station_id  : chr  "KA1503000043" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.6 -87.7 -87.7 -87.7 -87.7 ...
 $ member_casual     : chr  "member" "member" "member" "member

In [14]:
str(dat2021November)

'data.frame':	359978 obs. of  13 variables:
 $ ride_id           : chr  "7C00A93E10556E47" "90854840DFD508BA" "0A7D10CDD144061C" "2F3BE33085BCFF02" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-11-27 13:27:38" "2021-11-27 13:38:25" "2021-11-26 22:03:34" "2021-11-27 09:56:49" ...
 $ ended_at          : chr  "2021-11-27 13:46:38" "2021-11-27 13:56:10" "2021-11-26 22:05:56" "2021-11-27 10:01:50" ...
 $ start_station_name: chr  "" "" "" "" ...
 $ start_station_id  : chr  "" "" "" "" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 42 42 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ end_lat           : num  42 41.9 42 41.9 41.9 ...
 $ end_lng           : num  -87.7 -87.7 -87.7 -87.8 -87.6 ...
 $ member_casual     : chr  "casual" "casual" "casual" "casual" ...


In [15]:
str(dat2021December)

'data.frame':	247540 obs. of  13 variables:
 $ ride_id           : chr  "46F8167220E4431F" "73A77762838B32FD" "4CF42452054F59C5" "3278BA87BF698339" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "classic_bike" ...
 $ started_at        : chr  "2021-12-07 15:06:07" "2021-12-11 03:43:29" "2021-12-15 23:10:28" "2021-12-26 16:16:10" ...
 $ ended_at          : chr  "2021-12-07 15:13:42" "2021-12-11 04:10:23" "2021-12-15 23:23:14" "2021-12-26 16:30:53" ...
 $ start_station_name: chr  "Laflin St & Cullerton St" "LaSalle Dr & Huron St" "Halsted St & North Branch St" "Halsted St & North Branch St" ...
 $ start_station_id  : chr  "13307" "KP1705001026" "KA1504000117" "KA1504000117" ...
 $ end_station_name  : chr  "Morgan St & Polk St" "Clarendon Ave & Leland Ave" "Broadway & Barry Ave" "LaSalle Dr & Huron St" ...
 $ end_station_id    : chr  "TA1307000130" "TA1307000119" "13137" "KP1705001026" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_ln

## Pre-analysis on January- December data.
The twelve(12) independent datasets have the same column names,column names and 

The next step is to combine the twelve(12) independent dataframes into a single dataframe

In [16]:
datfill<- rbind(dat2021January,dat2021February,dat2021March,dat2021April,dat2021May,dat2021June,dat2021July,dat2021August,dat2021September,dat2021October,dat2021November,dat2021December)

## Getting to know your data in detail

In [17]:
colnames(datfill)

In [18]:
head(datfill)

Unnamed: 0_level_0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
1,E19E6F1B8D4C42ED,electric_bike,2021-01-23 16:14:19,2021-01-23 16:24:44,California Ave & Cortez St,17660,,,41.90034,-87.69674,41.89,-87.72,member
2,DC88F20C2C55F27F,electric_bike,2021-01-27 18:43:08,2021-01-27 18:47:12,California Ave & Cortez St,17660,,,41.90033,-87.69671,41.9,-87.69,member
3,EC45C94683FE3F27,electric_bike,2021-01-21 22:35:54,2021-01-21 22:37:14,California Ave & Cortez St,17660,,,41.90031,-87.69664,41.9,-87.7,member
4,4FA453A75AE377DB,electric_bike,2021-01-07 13:31:13,2021-01-07 13:42:55,California Ave & Cortez St,17660,,,41.9004,-87.69666,41.92,-87.69,member
5,BE5E8EB4E7263A0B,electric_bike,2021-01-23 02:24:02,2021-01-23 02:24:45,California Ave & Cortez St,17660,,,41.90033,-87.6967,41.9,-87.7,casual
6,5D8969F88C773979,electric_bike,2021-01-09 14:24:07,2021-01-09 15:17:54,California Ave & Cortez St,17660,,,41.90041,-87.69676,41.94,-87.71,casual


In [19]:
glimpse(datfill)

Rows: 5,595,063
Columns: 13
$ ride_id            [3m[90m<chr>[39m[23m "E19E6F1B8D4C42ED", "DC88F20C2C55F27F", "EC45C94683…
$ rideable_type      [3m[90m<chr>[39m[23m "electric_bike", "electric_bike", "electric_bike", …
$ started_at         [3m[90m<chr>[39m[23m "2021-01-23 16:14:19", "2021-01-27 18:43:08", "2021…
$ ended_at           [3m[90m<chr>[39m[23m "2021-01-23 16:24:44", "2021-01-27 18:47:12", "2021…
$ start_station_name [3m[90m<chr>[39m[23m "California Ave & Cortez St", "California Ave & Cor…
$ start_station_id   [3m[90m<chr>[39m[23m "17660", "17660", "17660", "17660", "17660", "17660…
$ end_station_name   [3m[90m<chr>[39m[23m "", "", "", "", "", "", "", "", "", "Wood St & Augu…
$ end_station_id     [3m[90m<chr>[39m[23m "", "", "", "", "", "", "", "", "", "657", "13258",…
$ start_lat          [3m[90m<dbl>[39m[23m 41.90034, 41.90033, 41.90031, 41.90040, 41.90033, 4…
$ start_lng          [3m[90m<dbl>[39m[23m -87.69674, -87.69671, -87.69664, -8

In [20]:
str(datfill)

'data.frame':	5595063 obs. of  13 variables:
 $ ride_id           : chr  "E19E6F1B8D4C42ED" "DC88F20C2C55F27F" "EC45C94683FE3F27" "4FA453A75AE377DB" ...
 $ rideable_type     : chr  "electric_bike" "electric_bike" "electric_bike" "electric_bike" ...
 $ started_at        : chr  "2021-01-23 16:14:19" "2021-01-27 18:43:08" "2021-01-21 22:35:54" "2021-01-07 13:31:13" ...
 $ ended_at          : chr  "2021-01-23 16:24:44" "2021-01-27 18:47:12" "2021-01-21 22:37:14" "2021-01-07 13:42:55" ...
 $ start_station_name: chr  "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" "California Ave & Cortez St" ...
 $ start_station_id  : chr  "17660" "17660" "17660" "17660" ...
 $ end_station_name  : chr  "" "" "" "" ...
 $ end_station_id    : chr  "" "" "" "" ...
 $ start_lat         : num  41.9 41.9 41.9 41.9 41.9 ...
 $ start_lng         : num  -87.7 -87.7 -87.7 -87.7 -87.7 ...
 $ end_lat           : num  41.9 41.9 41.9 41.9 41.9 ...
 $ end_lng           : num  -87.7 -

In [21]:
summary(datfill)

   ride_id          rideable_type       started_at          ended_at        
 Length:5595063     Length:5595063     Length:5595063     Length:5595063    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
 start_station_name start_station_id   end_station_name   end_station_id    
 Length:5595063     Length:5595063     Length:5595063     Length:5595063    
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            

In [22]:
skim_without_charts(datfill)

In [None]:
head(datfill)

In [None]:
sum(is.na(datfill))

# Cleaning your data

We need to check for duplicated and errors in the data that might lead to issues on bias or credibility.
We need to drop NA values.
I will drop na value and change the format of date colomn as it's formatted as "chr" type. We will load library "lubridate" to achieve that.



Also looking at the colnames we need to select just those columns that give insights about behavioural differences between casual riders and annual members in terms of bike utilisation, such as:

trip_id,starttime,to_station_id
start_station_id
end_station_name

We have to remove “rideable_types” as well , because the rideable type data is only available in “q1_2020” and we do not have a meta dataset that contains details of ridable_type details according to bike ID.  We can simply select the required column in the new dataf

In [None]:
new_datfill <- drop_na(datfill)

In [None]:
sum(is.na(new_datfill))

In [None]:
library(lubridate)

### Correct the date format

In [None]:
new_datfill$date <- as.Date(new_datfill$started_at)
new_datfill$month <- format(as.Date(new_datfill$date), "%m")

In [None]:
unique(new_datfill$month)

### Create new useful columns
l think it's useful to create new columns are ride_length to calculate how long the clients ride, day_of_week to see the frequency on which weekday are the highest, ride_distance to calculate how far clients ride

In [None]:
library(geosphere)
new_datfill$ride_length<- difftime(new_datfill$ended_at,new_datfill$started_at, units=c("mins"))


new_datfill$day_of_week <- format(as.Date(new_datfill$date), "%A")

new_datfill$ride_distance <- distGeo(matrix(c(new_datfill$start_lng, new_datfill$start_lat), ncol = 2),
                                       matrix(c(new_datfill$end_lng, new_datfill$end_lat), ncol = 2))
new_datfill$ride_distance <- new_datfill$ride_distance/1000


In [None]:
head(new_datfill)

### Saving the result as a CSV

In [None]:
new_datfill %>%
  write.csv("Bicyle_Cyclist_clean_data.csv")

### Guiding questions
What tools are you choosing and why?
I'm using R for this project, for two main reasons: Because of the large dataset and to gather experience with the language.

### Have you ensured your data’s integrity?
Yes, the data is consistent throughout the columns.

### What steps have you taken to ensure that your data is clean?
First the duplicated values where removed, then the columns where formatted to their correct format.

### How can you verify that your data is clean and ready to analyze?
It can be verified by this notebook.

### Have you documented your cleaning process so you can review and share those results?
Yes, it's all documented in this R notebook.

### Key tasks
###### [x] Check the data for errors.
###### [x] Choose your tools.
###### [x] Transform the data so you can work with it eectively
###### [x] Document the cleaning process.

### Deliverable
##### [x] Documentation of any cleaning or manipulation of data

# Analyze
The data exploration will consist of building a profile for annual members and how they differ from casual riders.

Putting in a new variable with a simpler name will help reduce some typing in the future.

In [None]:
data_group <- new_datfill %>% 
  group_by(member_casual) %>% 
  dplyr::summarise(avg_time = mean(ride_length), avg_distance = mean(ride_distance)) %>% 
  as.data.frame()

In [None]:
library(ggplot2)

In [None]:
ggplot(data = data_group) + 
  geom_col(mapping=aes(x=member_casual,y=avg_distance,fill=member_casual), show.legend = TRUE)+
  labs(title = "Avg travel distance by User type",x="User Type",y="Mean distance In Km")

## Analysis
Throughout the year mean distance in KM covered by casual riders was more than distance covered by members.

In [None]:
ggplot(data = data_group) + 
  geom_col(mapping=aes(x = member_casual, y = avg_time, fill=member_casual), show.legend = TRUE)+
  labs(title = "Avg travel time by User type",x="User Type",y="Avg time in Mins")

## Analysis
The time travelled by casual riders was more than member on average thgroughout the year

In [None]:
new_datfill %>% 
  mutate(months = month(started_at, label = TRUE)) %>% 
  group_by(member_casual, month) %>% 
  dplyr::summarise(number_of_rides = n(),
            average_duration = mean(ride_length), .groups = 'drop' ) %>% 
  arrange(member_casual, month) %>% 
  ggplot(aes(x = month, y = number_of_rides, fill = member_casual))+
  geom_col(position = "dodge")+
  labs(title = "Number of rides by User type during the month",x="January to December",y="Number of rides", fill="User type") +
  theme(legend.position="top")

# Analysis
Casual riders are normaly distributed throughout the year

Casual riders tend to increase in February and there are maximum in July because in the United States, summer break is approximately two and a half months, with students typically finishing the school year between late-May and mid-June and starting the new year between late-August and early-September. Thus we have more riders during holiday.

In [None]:
new_datfill %>% 
  mutate( weekday = wday(started_at, label = TRUE)) %>% 
  group_by(member_casual, weekday) %>% 
  dplyr::summarise(number_of_rides = n(), average_duration = mean(ride_length), .groups = "drop") %>% 
  arrange(member_casual, weekday) %>% 
ggplot()+
  geom_col(mapping = aes(x = weekday, y = number_of_rides, fill = member_casual), position = "dodge")+
  labs(title = "Number of rides by user type during a week", x = "Weekdays", y = "Number of rides", fill = "usertype")+
  theme(legend.position = "top")

In [None]:
head(new_datfill)

In [None]:
new_datfill %>% 
  group_by(member_casual,day_of_week) %>%  
  summarise(number_of_rides = n(),total_duration_mins = sum(ride_distance)) %>% 
  arrange(member_casual, desc(number_of_rides))

# Analysis
We have the highest number of casual rides on Saturday 557187 followed by Sunday 480395

Least number of casual rides happens on Tuesday

Highest number of member rides happens on Wednesdsay and the least number of rides for members happens on Sunday

In [None]:
new_datfill %>%  
  group_by(member_casual, day_of_week) %>% 
  summarise(number_of_rides = n()) %>% 
  arrange(member_casual, day_of_week)  %>% 
  ggplot(aes(x = day_of_week, y = number_of_rides, fill = member_casual)) +
  labs(title ="Total rides by customer type Vs. Day of the week") +
  geom_col(width=0.5, position = position_dodge(width=0.5)) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE))

### Share 
After finding the average ride time and average time ride distance, I figure out during the summertime, casual clients take more rides than member clients, it indicates that casual user would like to ride bikes in the summer.

And about the weekly, I found that casual clients ride more at the weekend especially Saturday and Sunday, it seems like most casual users are likely to be students or go on working.

As the Cyclistic has three types of bikes, I analyze the data to figure out which type of bike users preferred more. I found that both users types like to ride classic bikes. After the Covid Pendamic, all human being is now conscious about their health, So, it might be the reason that most users like to ride classic bikes.

And when it comes to the Cyclistic's bikes type, most of clients tend to ride classic bikes, it's really healthy for people to ride a bike on their on physique instead of using electricity.

## Analysis

Generally casual riders are less than members during the week

Note
consistent number of rides memebers with less spread over weekdays

 

In [None]:
head(new_datfill)

## Analysis
We have more casual riders during Saturday and Sunday since more people will be off from work.

In [None]:
preferred_bike_b <- new_datfill %>% filter(rideable_type=="classic_bike" | rideable_type=="electric_bike")
  
preferred_bike_b %>%  group_by(member_casual, rideable_type) %>% 
  dplyr::summarise(totals = n(), .groups = "drop") %>%
  ggplot()+
  geom_col(mapping = aes(x = member_casual, y = totals, fill = rideable_type), position = "dodge")+
  labs(title = "Preferred Bike Type by Users", x="User type",y= "Number of rides", fill="Bike type") +
  theme_minimal() +
  theme(legend.position="top")

## Analysis
Casual riders make use of classic_ bike more than electric bike

Member riders make use of classic_bike more than electric bike.

# Deliverable

### Conclusions

Based on ride duration in minutes of bikes by casual riders is almost twice of member riders

Casual customers use bikeshare services more during weekends, while members use them consistently over the entire week.

Casual riders prefer docked bikes, while classic bikes and electric bikes are used by members and casuals almost equally.

Average ride duration of casual riders is more than twice that of member rider over any given day of the week cumulatively.


### Recommendations

Provide promotions for casual riders for membership subscription.

Offer discounted pricing during the week so that casual riders might choose to use bikes during the week.

Offer free promotion rides one working day for memmbership riders to attract casual riders to subscribe for membership

Provide attractive promotions for casual riders for membership subscription, so that on weekends members automatically will contribute to more profit.



