# Bike Share Analysis in Python
I learned a lot working on the bike share case study for my Google Data Analytics certificate. I got to apply the skills I learned from the course and experienced first hand how data can impact decisions. Working on the project allowed me to develop a strong foundation with R including how to wrangle, clean, and visualize data. 
I know that Python is another programming language used to analyze data and I want to redo the same project to demonstrate my ability with Python as well as to have a better understanding of the differences between R and Python. 

## Background
A little background on the scenario and data:

Cyclistic is a bike-share company in Chicago with a fleet of 5,824 geotracked bicycles and a network of 692 stations. They offer 3 pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders and customers who purchase annual memberships are referred to as Cyclistic members. The company determined that annual members are more profitable than casual riders and is aiming to create a marketing campaign to maximize the number of annual members by converting casual riders.

**Key stakeholder**: Lily Moreno, the director of marketing responsible for developing campaigns to promote bike-share program.

**Business task**: How do annual members and casual riders use Cyclistic bikes differently?

***

In [None]:
# Setup envionment and import libraries
from matplotlib import pyplot as plt 
import pandas as pd
import seaborn as sns
import numpy as np
import glob
from pandas.api.types import CategoricalDtype

# Full output for each cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Import Data
Proxy data from a similar bike sharing company called Divvy was used. The data can be downloaded at <a href="https://divvy-tripdata.s3.amazonaws.com/index.html" target="_blank">https://divvy-tripdata.s3.amazonaws.com/index.html</a>
12 months of data (12 CSV files) were downloaded and merged into one dataframe.
The dataframe was previewed to confirm merge was sucessful and to examine column headers and data types. 

In [None]:
# Read and merge CSV files to dataframe
df = pd.concat(map(pd.read_csv, glob.glob("data/*.csv")))
# Preview data
df
df.info()

## Prepare Data
`started_at` and `ended_at` columns were converted to datetime format to allow for easy creation of year, month, day, and day of the week columns. `day_of_week` was converted to a categorical data type to maintain the order of the days of the week. To determine the ride length, `start_at` was subtracted from `ended_at` which created a timedelta object. This was converted to seconds (a float data type) to allow for comparison integers later and creation of plots.

In [None]:
# Convert started_at and ended_at columns to DateTime format
df['started_at'] = pd.to_datetime(df['started_at'], format = '%Y-%m-%d %H:%M:%S')
df['ended_at'] = pd.to_datetime(df['ended_at'], format = '%Y-%m-%d %H:%M:%S')

# Add columns for year, month, day, and day of the week 
df['year'] = df['started_at'].dt.year
df['month'] = df['started_at'].dt.month
df['day'] = df['started_at'].dt.day
df['day_of_week'] = df['started_at'].dt.day_name()

# Order days of the week
cats = ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
cat_type = CategoricalDtype(categories = cats, ordered = True)
df['day_of_week'] = df['day_of_week'].astype(cat_type)

# Add column for ride length
df['ride_length'] = df['ended_at'] - df['started_at']
# Convert DateTime format to seconds
df['ride_length'] = df['ride_length'].dt.total_seconds()

## Clean Data
The data was examined to confirm each row is unique (no duplicates) and within the specified date range. Values in the `member_casual` and `rideable_type` columns were evaluated to make sure there was nothing unexpected and `ride_length` was checked for negative values. 

Rows with missing values were dropped along with rows containing `docked_bike` and negative ride lengths. The latitude and longitude columns were dropped because we would not be using them later for plotting. 

In [None]:
# Check for duplicate rows
df_dupes = df[df.duplicated(['ride_id'])]
print(df_dupes)
# Check data is within date range
df['started_at'].max()
df['started_at'].min()
# Check for inconsistent data (i.e. more than 2 member types)
df['member_casual'].value_counts()
df['rideable_type'].value_counts()
# Check for negative ride durations
negative_ride_length = df[(df['ride_length'] < 0)]
print(negative_ride_length)

In [None]:
# Remove irrelevant columns and missing rows
df.drop(['start_lat', 'start_lng', 'end_lat', 'end_lng'], axis = 1, inplace = True)
df.dropna(subset = ['start_station_name', 'end_station_name'])

# Drop rows with docked_bike or negative ride length
df = df[(df['rideable_type'] != 'docked_bike') & (df['ride_length'].dt.total_seconds() > 0)]

In [None]:
df['ride_length'].describe()

In [None]:
# Group by day of week and member type, then aggregate average ride length
df1 = df.groupby(['day_of_week', 'member_casual'])['ride_length'].mean()
print(df1)

In [None]:
#Plot average ride length vs Month
plt.figure(figsize=[15, 14])
sns.barplot(data = df, x = 'day_of_week', y = 'ride_length', hue = 'member_casual')
plt.title('Average Ride Length vs. Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Average Ride Length (seconds)')

In [None]:
#Plot number of rides vs. day of the week
plt.figure(figsize=[15, 14])
sns.countplot(data = df, x = 'day_of_week', hue = 'member_casual')
plt.title('Number of Rides vs. Day of the Week')
plt.xlabel('Day of the Week')
plt.ylabel('Number of Rides')