# 2022 Cogo Ride Exploration and Visualization
## by Fisayo Sofuwa

## Introduction
The [Cogo Bike](https://en.wikipedia.org/wiki/CoGo) Share system launched in July 2013 with a network of 300 bicycles and 30 stations located throughout downtown Columbus. Today, CoGo boasts around 90 stations and 600 bikes serving Columbus, Bexley, Upper Arlington, Grandview Heights and Easton. The system provides Columbus residents and visitors an additional transportation option for getting around town that is fun, easy, and affordable. 

CoGo is available for use 24 hours a day, 365 days a year and includes both classic pedal bikes and electric assist ebikes. The station network provides twice as many docking points as bicycles, assuring that an available dock to return your bicycle is always nearby. [here](https://cogo-sys-data.s3.amazonaws.com/index.html)

>**N.B**: The data used for this exploration is data from **`January`** to **`August`**.

## Preliminary Wrangling


In [2]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import glob
import os

%matplotlib inline
sns.set_style('darkgrid')
matplotlib.rcParams['figure.figsize'] = (8,6) # Adjust the comfiguretion of the plots we will create.

ImportError: cannot import name 'polyutils' from partially initialized module 'numpy.polynomial' (most likely due to a circular import) (C:\Users\DeySholey\anaconda3\lib\site-packages\numpy\polynomial\__init__.py)

## Gathering and Assessing Data

In [None]:
# Get monthly CSV trip files from a folder and append data together
folder_name = 'data'

# Check if the file combined file exist and pass 
# else read the CSV files to append the data together
if os.path.exists("cogo_trips_2022.csv"):
    pass
else:
    csv_files = glob.glob(os.path.join(folder_name, "*.csv"))
    # Read each CSV file into DataFrame
    # This creates a list of dataframes
    trip_list = [pd.read_csv(file) for file in csv_files]
    # Concatenate all DataFrames
    full_trip   = pd.concat(trip_list, ignore_index=True)
    
    print(full_trip.shape)
    full_trip.head()
    
    # save the appended data to a .csv file for further usage
    full_trip.to_csv('cogo_trips_2022.csv', index=False)

In [None]:
biketrips22 = pd.read_csv('cogo_trips_2022.csv')
print(biketrips22.shape)
biketrips22.head()

In [None]:
biketrips22.info()

In [None]:
# check for null values
biketrips22.isna().sum()

In [None]:
biketrips22.duplicated().sum()

There are no duplicates in the data

In [None]:
biketrips22.rideable_type.value_counts()

In [None]:
biketrips22.member_casual.value_counts()

## Issues

1. started_at and ended_at not in correct format.
2. Missing values start_station_id, start_station_name, end_station_id and end_station_name, and end_lat and end_lng.
3. Drop unnecessary columns(ride_id, start_station_id, end_station_id)
4. Create a trip duration column from started_at and ended_at.
5. Create trip start date, trip start hour of the day, day of the week and month

## Cleaning Data

In [None]:
# make a copy of the datframe
trips22 = biketrips22.copy()

In [None]:
# issue 1: Change started_at and ended_at to datetime format.
trips22['started_at'] = pd.to_datetime(trips22['started_at'])
trips22['ended_at'] = pd.to_datetime(trips22['ended_at'])

trips22.info()

In [None]:
# issue 2: drop missing values in end_station_id and end_station_name, and end_lat and end_lng.
trips22 = trips22.dropna(axis=0)
trips22.info()

In [None]:
# issue 3: Drop unnecessary columns(ride_id, start_station_id, end_station_id)
trips22 = trips22.drop(columns = ['ride_id', 'start_station_id', 'end_station_id'])
trips22.columns

In [None]:
# issue 4: Create a trip duration column from started_at and ended_at.
trips22['duration_minute'] = trips22['ended_at'] - trips22['started_at']
trips22['duration_minute'] = trips22['duration_minute'].dt.components.minutes
trips22.sample(5)

In [None]:
# issue 5: Create trip start date, trip start hour of the day, day of the week and month
trips22['start_date'] = trips22['started_at'].dt.date
trips22['start_hourofday'] = trips22['started_at'].dt.hour
trips22['start_dayofweek'] = trips22['started_at'].dt.day_name()
trips22['start_month'] = trips22['started_at'].dt.month_name()

trips22.sample(3)

### What is the structure of your dataset?

> The original combined data contains data from January to August which is approximately 35,800 individual trip records with 13 variables collected. The variables can be divided into 3 categories:
> * trip duration: `started_at`, and `ended_at`.
> * station info: `start_station_name`, `start_station_id`, `end_station_name`, `end_station_id`, `start_lat`, `start_lng`,  `end_lat`, `end_lng`.
> * member info: `member_casual`, `rideable_type`, and `ride_id`

> Derived features/variables to assist exploration and analysis:
> * trip info: `duration_minute`, `start_date`, `start_hourofday`, `start_dayofweek`, `start_month`.


### What is/are the main feature(s) of interest in your dataset?

> I'm interested in exploring the bike trips' duration and rental events occurrance patterns, along with how these relate to the riders' characteristics, i.e user type(member or casual) to get a sense of how and what people are using the bike sharing services for. Sample questions to answer: When are most trips taken in term of time of day, day od week, or month of the year? How long does average trip take? Does the above depend on if a user is a casual or member user type?

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

> Each trips' start date/time and duration information will help understand how long a trip usually takes and when. The member information user type(casual or member) will help us find out which group of the user type(member or casual) utilizes the services more, use the different groups to summarize bike usage data to see if there is any special pattern associated with the specific groups of riders.

## Univariate Exploration



A series of plots to first explore the trips distribution over hour of the day, day of the week, and month of the year.

In [None]:
def label(x, y, t):
    """
    Args:
    x: x-axis title
    y: y-axis title
    t: main title
    """
    
    plt.xlabel(x)
    plt.ylabel(y)
    plt.title(t)
    plt.show()

In [None]:
# trip distribution over day hours

base_color = sns.color_palette()[0]
sns.countplot(data=trips22, x='start_hourofday', color=base_color)
label('Trip Start Hour of Day', 'Count');

In [None]:
# trip distribution over weekdays
# change the start_dayofweek to categorical datatype
weekday = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekdaycat = pd.api.types.CategoricalDtype(ordered=True, categories=weekday)
trips22['start_dayofweek'] = trips22['start_dayofweek'].astype(weekdaycat)

sns.countplot(data=trips22, x='start_dayofweek', color=base_color)
label('Trip Start Day of Week', 'Count');

In [None]:
# trip distribution over months
# change the start_month to categorical datatype
month = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August']
monthcat = pd.api.types.CategoricalDtype(ordered=True, categories=month)
trips22['start_month'] = trips22['start_month'].astype(monthcat)

sns.countplot(data=trips22, y='start_month', color=base_color)
label('Count', 'Trip Start Month');

The trip distribution over day hours is unimodel and peak late afternoon to evening around 15pm-19pm, during typical rush hours. Looking in combined with the distribution over day of week plot. With slight differnece during weekdays, Saturday having the higest usage. For the 2022, between January to August, the ride service usage increases with month except for decrease in July and August with June having the highest usage.

**N.B**: The data used for this exploration is data from **`January`** to **`August`**.

The next several plots are around user type(member or casual) and ride type(rideable_type) to have a sense of the typical user of the service.

In [None]:
sns.countplot(data=trips22, x='member_casual', color=base_color)

label('User Type','Count');

In [None]:
sns.countplot(data=trips22, x='rideable_type', color=base_color)

label('Ride Type','Count');

Most riders are casuals and ride type are classic bike with little docked bike usage.

Trip duration distribution plot next

In [None]:
bins = np.arange(0, trips22['duration_minute'].max()+1, 1)
plt.hist(data=trips22, x='duration_minute', bins=bins, color=base_color)
plt.xlabel('Triip Duration in Minute')
ticks = np.arange(0, trips22['duration_minute'].max()+5, 5)
plt.xticks(ticks, ticks);

The trip distribution is right skewed with majority of the trips less than 1 hours within 4 to 13 minutes.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> The number of trips peaked around 15-19pm with saturday been higher than other days. For the 2022, between January to August, the ride service usage increases with month except for almost same usage for January and February, and for fall in July and August.

>Most riders are casuals and ride type are classic bike.

>Most rides were quick and short, lasted between 4 to 13 minutes.

>There was no unusual points and no transformation was also needed due to straightforwardness of the data.

**N.B**: The data used for this exploration is data from **`January`** to **`August`**.



### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> The trips duration was extracted from the started_at and ended_at features. The start_date, start_hourofday, start_dayofweek, start_month was also gotten from started_at feature.

## Bivariate Exploration

How does the trip distribution vary between member and casual, electric bike, classic bike and docked bike?

In [None]:
sns.violinplot(data=trips22, x='member_casual', y='duration_minute', color=base_color, inner='quartile')
label('User Type', 'Trip Duration in Minute');

The trip duration distribution is much narrower for members compared to casual riders on the longer trip end overall. It seems like members have a more specific usage or targeted goal riding the bikes compared to casual who rented the bikes for long.

How does the trips duratiion distribution vary by ride type?

In [None]:
sns.boxplot(data=trips22, x='rideable_type', y='duration_minute', color=base_color)
label('Ride Type', 'Trip Duration in Minute');

Though slight difference between electric and classic bikes. Docked bikes have higher trips duration.

Average Trip Duration by Hours of the Day.

In [None]:
sns.barplot(data=trips22, x='start_hourofday', y='duration_minute', color=base_color)
label('Hour of Day', 'Avg. Trip Duration in Minute');

The riding trips are much shorted between 4am-9am with highest usage in 2pm.

Average Trip Duration on Weekdays

In [None]:
sns.barplot(data=trips22, x='start_dayofweek', y='duration_minute', color=base_color)
label('Day of week', 'Avg. Trip Duration in Minute');

The riding trips are much shorted on Monday through Friday compared to weekends. It indicates a stable and efficient usage of the sharing system  on normal work days, while more casual flexible use on weekends.

Average Trip Duration on Month

In [None]:
sns.barplot(data=trips22, x='start_month', y='duration_minute', color=base_color)
label('Month', 'Avg. Trip Duration in Minute');

The average usage time show an increasing trend over the months from January to May, almost equal usage in June and July with a decrease in August.

Weekly usage between members and casuals

In [None]:
sns.countplot(data=trips22, x='start_dayofweek', hue='member_casual')
label('Day of Week', 'Count');

The difference between member and casual during weekends is very large compared to weekdays. Which signify members and casual uses this bike sharing system as channel of commute to work places, with high usage at weekend by casual.

Monthly usage between members and casuals

In [None]:
sns.countplot(data=trips22, x='start_month', hue='member_casual')
label('Day of Week', 'Count');

The plot above shows the increasing pattern of trips with larger trips made by members with the exception of January and February having very slight difference between casual and members.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> There are a lot of member than casual users and the riding habit varies alot between both user type. Members and casuals uses the bike sharing system for work commute thus having low average trip between on work days (Mon-Fri) whereas casual tends to ride for fun especially over the weekend. 


### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> It is interesting to see that docked bike type have higher average trip duration with little difference in electric and classic bikes.

## Multivariate Exploration

How does the average trip duration vary between member and casual, and ride type

In [None]:
sns.pointplot(data=trips22, x='start_dayofweek', y='duration_minute', hue='member_casual', dodge=0.3, linestyles='')
label('Day of Week', 'Avg. Trip Duration in Minute');

It can be seen from the above plot that members ride much shorter/quicker trips compared to casual  on each day of the week. Both user types have decrease trip duration trend between Monday to Thursday and increase between Friday to Sunday.

In [None]:
sns.pointplot(data=trips22, x='start_dayofweek', y='duration_minute', hue='rideable_type', dodge=0.3, linestyles='')
label('Ride Type', 'Avg. Trip Duration in Minute');

From the above plot, docked bikes high avearge trip duration compared with to classic and elcectric bikes with little differnce between both.

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> The multivariate exploration strengthened some of the patterns discovered in the previous bivariate exploration as well as univariate exploration, the relationship between the multiple variables plotted are visualized altogether and information are presented combined. The short and efficient period of usage  for member between Monday through Friday indicate the use primarily for work commute. The more higher and flexible pattern of casual use shows that they are taking advantage of the bike sharing system quite differntly heavily over the weekends for city tour or leisure purposs probably.

### Were there any interesting or surprising interactions between features?

> The interactions between features are all suplementing each other and quite make sense when looked at combined. There is no big surprised, except for high average trip duration of docked bikes compared to the electric and classic bikes.