# Cyclistic Bike Share Analysis

In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that
are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and
returned to any other station in the system anytime.

Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments.
One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes,
and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers
who purchase annual memberships are Cyclistic members.

**Here is the set goal**: To design marketing strategies aimed at converting casual riders into annual members. In order to
do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why
casual riders would buy a membership. We are
interested in analyzing the Cyclistic historical bike trip data to identify trends.

### Deliverables
1. A clear statement of the business task
2. A description of all data sources used
3. Documentation of any cleaning or manipulation of data
4. Summary of the analysis
5. Supporting visualizations and key findings
6. Top three recommendations based on the analysis

**Business Task**

How do annual members and casual riders use Cyclistic bikes differently?

****

## Data Sources

https://divvy-tripdata.s3.amazonaws.com/index.html

-------

## Data Preparation

In [1]:
# importing neccessary modules
import pandas as pd
import numpy as np
import seaborn as sns

In [2]:
data = pd.read_csv(r"C:\Users\WELLS\202305-divvy-tripdata.csv")

In [3]:
# previewing the data
data.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
0,0D9FA920C3062031,electric_bike,2023-05-07 19:53:48,2023-05-07 19:58:32,Southport Ave & Belmont Ave,13229,,,41.939408,-87.663831,41.93,-87.65,member
1,92485E5FB5888ACD,electric_bike,2023-05-06 18:54:08,2023-05-06 19:03:35,Southport Ave & Belmont Ave,13229,,,41.939482,-87.663848,41.94,-87.69,member
2,FB144B3FC8300187,electric_bike,2023-05-21 00:40:21,2023-05-21 00:44:36,Halsted St & 21st St,13162,,,41.853793,-87.646719,41.86,-87.65,member
3,DDEB93BC2CE9AA77,classic_bike,2023-05-10 16:47:01,2023-05-10 16:59:52,Carpenter St & Huron St,13196,Damen Ave & Cortland St,13133.0,41.894556,-87.653449,41.915983,-87.677335,member
4,C07B70172FC92F59,classic_bike,2023-05-09 18:30:34,2023-05-09 18:39:28,Southport Ave & Clark St,TA1308000047,Southport Ave & Belmont Ave,13229.0,41.957081,-87.664199,41.939478,-87.663748,member


In [4]:
data.shape

(604827, 13)

We have a total of 604,827 rows and 13 columns before cleaning

In [5]:
data.isnull().sum()

ride_id                   0
rideable_type             0
started_at                0
ended_at                  0
start_station_name    89240
start_station_id      89240
end_station_name      95267
end_station_id        95267
start_lat                 0
start_lng                 0
end_lat                 710
end_lng                 710
member_casual             0
dtype: int64

For this business task, we wouldn't be needing the start_station_name, start_station_id, end_station_name, and end_station_id, because the station details doesnt help in achieving the business task(**How do annual members and casual riders use Cyclistic bikes differently?**)

**Data cleaning**

In [6]:
#dropping the columns start_staion_name, start_station_id, end_station_name, and end_station_id

data_dropped = data.drop(['start_station_name', 'start_station_id', 'end_station_name', 'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng'], axis = 1)

In [7]:
data_dropped.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,member_casual
0,0D9FA920C3062031,electric_bike,2023-05-07 19:53:48,2023-05-07 19:58:32,member
1,92485E5FB5888ACD,electric_bike,2023-05-06 18:54:08,2023-05-06 19:03:35,member
2,FB144B3FC8300187,electric_bike,2023-05-21 00:40:21,2023-05-21 00:44:36,member
3,DDEB93BC2CE9AA77,classic_bike,2023-05-10 16:47:01,2023-05-10 16:59:52,member
4,C07B70172FC92F59,classic_bike,2023-05-09 18:30:34,2023-05-09 18:39:28,member


In [8]:
data_dropped.isnull().sum()

ride_id          0
rideable_type    0
started_at       0
ended_at         0
member_casual    0
dtype: int64

In [9]:
#getting the data types of the columns and the formatting

data_dropped.dtypes

ride_id          object
rideable_type    object
started_at       object
ended_at         object
member_casual    object
dtype: object

In [10]:
data_dropped['started_at'] = pd.to_datetime(data_dropped['started_at'])

In [11]:
data_dropped['ended_at'] = pd.to_datetime(data_dropped['ended_at'])

In [12]:
data_dropped.dtypes

ride_id                  object
rideable_type            object
started_at       datetime64[ns]
ended_at         datetime64[ns]
member_casual            object
dtype: object

**Feature Engineering**

In [13]:
# creating a column called “ride_length" to determine the duration of the ride, and if it has any effect on the membership plan

data_dropped['ride_length'] = data_dropped['ended_at'] - data_dropped['started_at']

In [14]:
data_dropped['day_of_week'] = data_dropped['started_at'].dt.day_name()

In [15]:
# creating a column that calculates the total ride length in seconds

data_dropped['ride_length_in_sec'] = data_dropped['ride_length'].dt.total_seconds()

In [16]:
data_dropped.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,member_casual,ride_length,day_of_week,ride_length_in_sec
0,0D9FA920C3062031,electric_bike,2023-05-07 19:53:48,2023-05-07 19:58:32,member,0 days 00:04:44,Sunday,284.0
1,92485E5FB5888ACD,electric_bike,2023-05-06 18:54:08,2023-05-06 19:03:35,member,0 days 00:09:27,Saturday,567.0
2,FB144B3FC8300187,electric_bike,2023-05-21 00:40:21,2023-05-21 00:44:36,member,0 days 00:04:15,Sunday,255.0
3,DDEB93BC2CE9AA77,classic_bike,2023-05-10 16:47:01,2023-05-10 16:59:52,member,0 days 00:12:51,Wednesday,771.0
4,C07B70172FC92F59,classic_bike,2023-05-09 18:30:34,2023-05-09 18:39:28,member,0 days 00:08:54,Tuesday,534.0


The data is cleaned and ready for Analysis

## Exploratory Data Analysis

How many distinct Rideable_type and Member_casual are there?

In [17]:
data_dropped['rideable_type'].unique()

array(['electric_bike', 'classic_bike', 'docked_bike'], dtype=object)

In [18]:
data_dropped['member_casual'].unique()

array(['member', 'casual'], dtype=object)

**Are there any relationship between the type of bikes and the membership plan?**

i.e, we need to determine if casual members rides a specific bike type and if also there is a bike type annual members prefer.

This helps to help us know how annual members and casual riders use Cyclistic bikes differently

In [19]:
data_group_type = data_dropped.groupby(['member_casual', 'rideable_type'])[['rideable_type']].count()

In [20]:
data_group_type

Unnamed: 0_level_0,Unnamed: 1_level_0,rideable_type
member_casual,rideable_type,Unnamed: 2_level_1
casual,classic_bike,92598
casual,docked_bike,13092
casual,electric_bike,128491
member,classic_bike,177297
member,electric_bike,193349


From the data above, it tells us that on both of the membership plan (annual members and casual riders), they both use more of the **electric bike type**. The only difference is that the members riders don't use **docked_bike at all**.

---------

This promts the question: "Does Members riders plan have provisions for **"docked bikes"**?

if YES, it then seems that riders in the members plan don't use the docked bikes at all, which seems impossible.

But, 

if NO, then we can **recommend that docked bikes be added to the members riders plan in other to draw customers from the casual plan to the members plan**

--------

**How does the ride length affect the membership plan**

Firstly, we need to get the average ride length for the members and casual riders

In [21]:
#getting the average ride lengths in seconds
data_group_length = data_dropped.groupby(['member_casual'])[['ride_length_in_sec']].mean().round().reset_index()

In [22]:
data_group_length

Unnamed: 0,member_casual,ride_length_in_sec
0,casual,1711.0
1,member,782.0


In [23]:
#getting the average ride lengths by day_of_week
data_group_day = data_dropped.groupby(['day_of_week'])[['ride_length_in_sec']].mean().round().reset_index()

In [24]:
data_group_day.sort_values('ride_length_in_sec', ascending=False)

Unnamed: 0,day_of_week,ride_length_in_sec
3,Sunday,1448.0
2,Saturday,1376.0
0,Friday,1103.0
1,Monday,1096.0
5,Tuesday,1042.0
4,Thursday,1027.0
6,Wednesday,950.0


In [25]:
#getting the average ride lengths by day_of_week and by membership plans
data_group_day_plan = data_dropped.groupby(['member_casual', 'day_of_week'])[['ride_length_in_sec']].mean().round()

In [26]:
data_group_day_plan.sort_values('ride_length_in_sec', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,ride_length_in_sec
member_casual,day_of_week,Unnamed: 2_level_1
casual,Sunday,1995.0
casual,Saturday,1913.0
casual,Monday,1669.0
casual,Friday,1668.0
casual,Thursday,1586.0
casual,Tuesday,1582.0
casual,Wednesday,1396.0
member,Sunday,890.0
member,Saturday,872.0
member,Tuesday,776.0


In [27]:
#getting the total number of rides by day_of_week
data_group_total = data_dropped.groupby('day_of_week')['rideable_type'].count().reset_index()

In [28]:
data_group_total = data_group_total.rename(columns={'rideable_type': 'count_of_rides'}, inplace=False)

In [29]:
data_group_total.sort_values('count_of_rides', ascending=False)

Unnamed: 0,day_of_week,count_of_rides
5,Tuesday,101062
6,Wednesday,98910
2,Saturday,84080
4,Thursday,83957
3,Sunday,83367
1,Monday,78437
0,Friday,75014


In [30]:
#getting the total number of rides by day_of_week and by membership plans
data_group_plan_total = data_dropped.groupby(['member_casual', 'day_of_week'])[['rideable_type']].count()

In [31]:
data_group_plan_total

Unnamed: 0_level_0,Unnamed: 1_level_0,rideable_type
member_casual,day_of_week,Unnamed: 2_level_1
casual,Friday,28712
casual,Monday,29541
casual,Saturday,40758
casual,Sunday,42124
casual,Thursday,28271
casual,Tuesday,33382
casual,Wednesday,31393
member,Friday,46302
member,Monday,48896
member,Saturday,43322


From the analysis above, these are the summary:
* The **casual** riders have an higher average ride_length than the **members** riders, which implies that casual riders journeys longer than the members riders.
* On Sundays, there are lengthier journeys than that of other days.
* **Members** riders rides more on **Tuesdays**, while **Casual** riders rides more on **Sundays**

----

We would be extracting the cleaned data for visualization, so as to easily communicate the findings from the data.

In [32]:
data_dropped.to_csv('cyclistic.csv', index=False)