#                                     Capstone Project

# Case Background
As a junior data analyst working in the marketing analyst team at Cyclistic (a bike-sharing company active in Chicago), I am tasked with understanding how casual riders and annual members use Cyclistic bikes differently. Casual riders consist of customers that purchase single-ride or full-day passes, whereas annual members subscribe yearly for unlimited biking access. The marketing director theorizes that the company's future success depends on maximizing the number of yearly memberships by converting casual riders into annual members. Pending executive approval, my team will be designing a new marketing strategy that pursues this idea.
To inform any decision-making behind Cyclistic's new marketing strategy, the goal of this project will be to uncover and convey actionable insights.


# Identify the business task

•	Product: bike-sharing geo tracked and network locked bikes across Chicago

•	Customer types and revenue model: members (annual subscribers) and casual riders (single-ride and full-day purchasers)

•	Competitive advantages: Bicycle variety and pricing flexibility. 

•	Social Media: digital media can influence casual riders to becoming members. 


#  key stakeholders
Cyclistic Executive Team :  Compelling, relevant, and straightforward insights to inform data-driven marketing decisions 		
Lily Moreno, Director of Marketing : Evidence to back up her theory and marketing recommendations 		
Marketing analytics team : Uncovering the differences and motivations behind different customer types		


# Data sources
The data we'll be using was extracted from here with helper dataset downloaded from here that will be used to filter out any dirty data in our primary dataset. This data is made available by Motivate International Inc. under this license.
Note that Cyclistic is a fictional entity and Divvy's open data is used for the purpose of this case study.
The data has already been sorted between July 2020 and June 2021. Any filtering that will be applied will exclude significant outliers and instances with errors.
We assume that the data collection process was accomplished with integrity. Furthermore, we move forward with this analysis under the assumption that the data is free of any glaring inaccuracies, bias, and credibility issues. We also assume that the original repository has never been accessed or modified in an unauthorized manner.


# Setup:load libraries and functions

In [49]:
import pandas as pd              # data manipulation
import numpy as np               # efficient data types
import matplotlib.pyplot as plt  # ploting visuals 
%matplotlib inline
import os                        # data file path handling
from glob import glob            # list all files that match a pattern

data=sorted(glob('F:\csvfile/divvy-tripdata_*.csv'))

data
dfs = []
for filename in data:
    dfs.append(pd.read_csv(filename))

# Concatenate all data into one DataFrame
big_frame = pd.concat(dfs, ignore_index=True)
big_frame

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week
0,762198876D69004D,docked_bike,7/9/2020 15:22,7/9/2020 15:25,Ritchie Ct & Banks St,180.0,Wells St & Evergreen Ave,291.0,41.906866,-87.626217,41.906724,-87.634830,member,0.002083,5
1,BEC9C9FBA0D4CF1B,docked_bike,7/24/2020 23:56,7/25/2020 0:20,Halsted St & Roscoe St,299.0,Broadway & Ridge Ave,461.0,41.943670,-87.648950,41.984045,-87.660274,member,0.016667,6
2,D2FD8EA432C77EC1,docked_bike,7/8/2020 19:49,7/8/2020 19:56,Lake Shore Dr & Diversey Pkwy,329.0,Clark St & Wellington Ave,156.0,41.932588,-87.636427,41.936497,-87.647539,casual,0.004861,4
3,54AE594E20B35881,docked_bike,7/17/2020 19:06,7/17/2020 19:27,LaSalle St & Illinois St,181.0,Clark St & Armitage Ave,94.0,41.890762,-87.631697,41.918306,-87.636282,casual,0.014583,6
4,54025FDC7440B56F,docked_bike,7/4/2020 10:39,7/4/2020 10:45,Lake Shore Dr & North Blvd,268.0,Clark St & Schiller St,301.0,41.911722,-87.626804,41.907993,-87.631501,member,0.004167,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4460146,9397BDD14798A1BA,docked_bike,3/20/2021 14:58,3/20/2021 17:22,Michigan Ave & Oak St,13042,New St & Illinois St,TA1306000013,41.900960,-87.623777,41.890847,-87.618617,casual,0.099896,7
4460147,BBBEB8D51AAD40DA,classic_bike,3/2/2021 11:35,3/2/2021 11:43,Kingsbury St & Kinzie St,KA1503000043,New St & Illinois St,TA1306000013,41.889177,-87.638506,41.890847,-87.618617,member,0.005868,3
4460148,637FF754DA0BD9E1,classic_bike,3/9/2021 11:07,3/9/2021 11:49,Michigan Ave & Oak St,13042,Clark St & Berwyn Ave,KA1504000146,41.900960,-87.623777,41.977997,-87.668047,member,0.028877,3
4460149,F8F43A0B978A7A35,classic_bike,3/1/2021 18:11,3/1/2021 18:18,Kingsbury St & Kinzie St,KA1503000043,New St & Illinois St,TA1306000013,41.889177,-87.638506,41.890847,-87.618617,member,0.004630,2


In [19]:
# get information from our dataframe (number of records, memory use and data types)

big_frame.info(memory_usage = 'deep')  

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4460151 entries, 0 to 4460150
Data columns (total 15 columns):
 #   Column              Dtype  
---  ------              -----  
 0   ride_id             object 
 1   rideable_type       object 
 2   started_at          object 
 3   ended_at            object 
 4   start_station_name  object 
 5   start_station_id    object 
 6   end_station_name    object 
 7   end_station_id      object 
 8   start_lat           float64
 9   start_lng           float64
 10  end_lat             float64
 11  end_lng             float64
 12  member_casual       object 
 13  ride_length         float64
 14  day_of_week         int64  
dtypes: float64(5), int64(1), object(9)
memory usage: 2.7 GB


In [20]:
# get descriptive statistics under each numeric column

big_frame.describe().apply(lambda s: s.apply('{0:.3f}'.format))   

Unnamed: 0,start_lat,start_lng,end_lat,end_lng,ride_length,day_of_week
count,4460151.0,4460151.0,4454865.0,4454865.0,4460151.0,4460151.0
mean,41.903,-87.644,41.903,-87.645,0.017,4.136
std,0.044,0.026,0.044,0.026,0.248,2.087
min,41.64,-87.87,41.51,-88.07,-20.174,1.0
25%,41.882,-87.659,41.882,-87.659,0.005,2.0
50%,41.899,-87.641,41.9,-87.641,0.009,4.0
75%,41.93,-87.627,41.93,-87.628,0.017,6.0
max,42.08,-87.52,42.16,-87.44,38.85,7.0


In [21]:
big_frame.columns    #the name of all the columns

Index(['ride_id', 'rideable_type', 'started_at', 'ended_at',
       'start_station_name', 'start_station_id', 'end_station_name',
       'end_station_id', 'start_lat', 'start_lng', 'end_lat', 'end_lng',
       'member_casual', 'ride_length', 'day_of_week'],
      dtype='object')

# cleaning data

In [22]:
big_frame.isnull().sum()     #find Missing value 

ride_id                    0
rideable_type              0
started_at                 0
ended_at                   0
start_station_name    282068
start_station_id      282694
end_station_name      315109
end_station_id        315570
start_lat                  0
start_lng                  0
end_lat                 5286
end_lng                 5286
member_casual              0
ride_length                0
day_of_week                0
dtype: int64

In [12]:
big_frame.duplicated()

0          False
1          False
2          False
3          False
4          False
           ...  
1048570    False
1048571    False
1048572    False
1048573    False
1048574    False
Length: 1048575, dtype: bool

In [23]:
big_frame.drop_duplicates() 

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week
0,762198876D69004D,docked_bike,7/9/2020 15:22,7/9/2020 15:25,Ritchie Ct & Banks St,180.0,Wells St & Evergreen Ave,291.0,41.906866,-87.626217,41.906724,-87.634830,member,0.002083,5
1,BEC9C9FBA0D4CF1B,docked_bike,7/24/2020 23:56,7/25/2020 0:20,Halsted St & Roscoe St,299.0,Broadway & Ridge Ave,461.0,41.943670,-87.648950,41.984045,-87.660274,member,0.016667,6
2,D2FD8EA432C77EC1,docked_bike,7/8/2020 19:49,7/8/2020 19:56,Lake Shore Dr & Diversey Pkwy,329.0,Clark St & Wellington Ave,156.0,41.932588,-87.636427,41.936497,-87.647539,casual,0.004861,4
3,54AE594E20B35881,docked_bike,7/17/2020 19:06,7/17/2020 19:27,LaSalle St & Illinois St,181.0,Clark St & Armitage Ave,94.0,41.890762,-87.631697,41.918306,-87.636282,casual,0.014583,6
4,54025FDC7440B56F,docked_bike,7/4/2020 10:39,7/4/2020 10:45,Lake Shore Dr & North Blvd,268.0,Clark St & Schiller St,301.0,41.911722,-87.626804,41.907993,-87.631501,member,0.004167,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4460146,9397BDD14798A1BA,docked_bike,3/20/2021 14:58,3/20/2021 17:22,Michigan Ave & Oak St,13042,New St & Illinois St,TA1306000013,41.900960,-87.623777,41.890847,-87.618617,casual,0.099896,7
4460147,BBBEB8D51AAD40DA,classic_bike,3/2/2021 11:35,3/2/2021 11:43,Kingsbury St & Kinzie St,KA1503000043,New St & Illinois St,TA1306000013,41.889177,-87.638506,41.890847,-87.618617,member,0.005868,3
4460148,637FF754DA0BD9E1,classic_bike,3/9/2021 11:07,3/9/2021 11:49,Michigan Ave & Oak St,13042,Clark St & Berwyn Ave,KA1504000146,41.900960,-87.623777,41.977997,-87.668047,member,0.028877,3
4460149,F8F43A0B978A7A35,classic_bike,3/1/2021 18:11,3/1/2021 18:18,Kingsbury St & Kinzie St,KA1503000043,New St & Illinois St,TA1306000013,41.889177,-87.638506,41.890847,-87.618617,member,0.004630,2


In [13]:
big_frame=big_frame.dropna(axis=0,how="any")    # Delete the rows that contain missing values 
big_frame.head()

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week,day_of_week.1,Avg Ride length,max ride_length
0,762198876D69004D,docked_bike,7/9/2020 15:22,7/9/2020 15:25,Ritchie Ct & Banks St,180.0,Wells St & Evergreen Ave,291.0,41.906866,-87.626217,41.906724,-87.63483,member,0.002083,5,Thursday,0.0222,37.445556


In [24]:
# mean,min and max ride_length values of casual and members
big_frame.groupby('member_casual').ride_length.agg(['mean','min','max']) 

Unnamed: 0_level_0,mean,min,max
member_casual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
casual,0.025545,-20.136053,38.850104
member,0.007978,-20.173588,23.209282


In [15]:
big_frame[["member_casual","day_of_week","ride_length","rideable_type"]] 

Unnamed: 0,member_casual,day_of_week,ride_length,rideable_type
0,member,5,0.002083,docked_bike


In [25]:
big_frame.groupby('member_casual')['day_of_week'].value_counts()

member_casual  day_of_week
casual         7              480691
               1              391537
               6              321675
               5              254546
               4              251795
               3              237183
               2              235533
member         4              353063
               7              335923
               6              334570
               3              331922
               5              321875
               2              311362
               1              298476
Name: day_of_week, dtype: int64

In [16]:
big_frame[big_frame["ride_length"]>1]

Unnamed: 0,ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_station_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual,ride_length,day_of_week,day_of_week.1,Avg Ride length,max ride_length


In [None]:
import matplotlib.pyplot as plt

x=big_frame['member_casual']
y=big_frame['day_of_week']
plt.bar(x,y)
plt.show()


# Recommendations

•	Personalize discounts and show perks in the membership program based on their preferences and riding habits.

•	Emphasize the benefits of memberships, including discounts during busy times of the year like during Summer, or on the weekends.

•	Have existing members to share their stories about how using Cyclistic's system has changed their life, to create a sense of community, offer a discount if they do so this will help encourage new riders to join the program.
