# Analyzing [Ford GoBike System](https://s3.amazonaws.com/fordgobike-data/index.html) Data

<span style="color: gray; font-size:1em;">Mateusz Zajac</span>
<br><span style="color: gray; font-size:1em;">Feb-2019</span>


## Table of Contents
- [Introduction](#intro)
- [Part I - Gathering Data](#gather)
- [Part II - Assessing Data](#assess)
- [Part III - Cleaning Data](#clean)
- [Part IV - Univariate Exploration](#univariate)
- [Part V - Bivariate Exploration](#bivariate)
- [Part VI - Multivariate Exploration](#multivariate)

<a id='intro'></a>
### Introduction

Ford GoBike is a regional public bicycle sharing system in the San Francisco Bay Area, California. Beginning operation in August 2013 as Bay Area Bike Share, the Ford GoBike system currently has over 2,600 bicycles in 262 stations across San Francisco, East Bay and San Jose. On June 28, 2017, the system officially launched as Ford GoBike in a partnership with Ford Motor Company.
<br>
<br>Ford GoBike, like other bike share systems, consists of a fleet of specially designed, sturdy and durable bikes that are locked into a network of docking stations throughout the city. The bikes can be unlocked from one station and returned to any other station in the system, making them ideal for one-way trips. The bikes are available for use 24 hours/day, 7 days/week, 365 days/year and riders have access to all bikes in the network when they become a member or purchase a pass.

### Preliminary Wrangling

This document explores the Ford GoBike's trip data for public containing approximately 1,850,000 bike rides from FY2018.

<a id='gather'></a>
## Part I - Gathering Data

In [28]:
# Import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime

import glob
import os

%matplotlib inline

In [35]:
path = r'C:\Users\Ags91\Jupyter_Notebooks\Udacity\Project 4_-_Communicate_Data_Findings'
# use your path
all_files = glob.glob(path + "/*.csv")

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

frame = pd.concat(li, axis=0, ignore_index=True)

ValueError: No objects to concatenate

In [36]:
# Path to the folder where all files has been stored
path =r'...Udacity\Project 4_-_Communicate_Data_Findings'

In [37]:
# Store list of all file locations
l_files = glob.glob(os.path.join(path, "*.csv"))

In [38]:
# load and union the dataset 
df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index = True)

ValueError: No objects to concatenate

In [39]:
# Write newly loaded data 
df.to_csv('master_file.csv', index=False)

NameError: name 'df' is not defined

>**path to the folder where all files has been stored**
<br>path =r'...\Udacity Projects\project_4-communicate_data_findings\Ford_GoBike_System_Data_201801-201901'

>**store list of all file locations**
<br>all_files = glob.glob(os.path.join(path, "*.csv"))

>**load and union the dataset**
<br>df = pd.concat((pd.read_csv(f) for f in all_files), ignore_index = True)

>**write newly loaded data**
<br>df.to_csv('master_file.csv', index=False)

In [44]:
# load the dataset
df = pd.read_csv('combine.csv')

<a id='assess'></a>
## Part II - Assessing  Data

In [43]:
# Visually check first 5 records
df.head()

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
0,75284,2018-01-31 22:52:35.2390,2018-02-01 19:47:19.8240,120,Mission Dolores Park,37.7614,-122.426,285,Webster St at O'Farrell St,37.7835,-122.431,2765,Subscriber,1986.0,Male,No
1,85422,2018-01-31 16:13:34.3510,2018-02-01 15:57:17.3100,15,San Francisco Ferry Building (Harry Bridges Pl...,37.7954,-122.394,15,San Francisco Ferry Building (Harry Bridges Pl...,37.7954,-122.394,2815,Customer,,,No
2,71576,2018-01-31 14:23:55.8890,2018-02-01 10:16:52.1160,304,Jackson St at 5th St,37.3488,-121.895,296,5th St at Virginia St,37.326,-121.877,3039,Customer,1996.0,Male,No
3,61076,2018-01-31 14:53:23.5620,2018-02-01 07:51:20.5000,75,Market St at Franklin St,37.7738,-122.421,47,4th St at Harrison St,37.781,-122.4,321,Customer,,,No
4,39966,2018-01-31 19:52:24.6670,2018-02-01 06:58:31.0530,74,Laguna St at Hayes St,37.7764,-122.426,19,Post St at Kearny St,37.789,-122.403,617,Subscriber,1991.0,Male,No


In [45]:
# Visually check 50 random records
df.sample(50)

Unnamed: 0,duration_sec,start_time,end_time,start_station_id,start_station_name,start_station_latitude,start_station_longitude,end_station_id,end_station_name,end_station_latitude,end_station_longitude,bike_id,user_type,member_birth_year,member_gender,bike_share_for_all_trip
66140,1060,2018-01-12 08:26:33.7700,2018-01-12 08:44:13.7910,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,9,Broadway at Battery St,37.79857210846256,-122.40086898207666,3152,Subscriber,,,No
569307,711,2018-05-10 08:26:21.4440,2018-05-10 08:38:13.2260,70,Central Ave at Fell St,37.7733,-122.444,43,San Francisco Public Library (Grove St at Hyde...,37.7788,-122.416,629,Subscriber,1990.0,Male,No
1332171,976,2018-09-12 08:36:09.1690,2018-09-12 08:52:25.6120,145,29th St at Church St,37.7437,-122.427,136,23rd St at San Bruno Ave,37.7544,-122.404,4336,Subscriber,1992.0,Male,No
1030138,789,2018-08-30 08:30:18.6520,2018-08-30 08:43:28.4860,30,San Francisco Caltrain (Townsend St at 4th St),37.776598,-122.395282,19,Post St at Kearny St,37.788975,-122.403452,2027,Subscriber,1986.0,Female,No
642701,343,2018-06-28 07:11:44.3900,2018-06-28 07:17:28.2760,317,San Salvador St at 9th St,37.333955,-121.877349,311,Paseo De San Antonio at 2nd St,37.333798,-121.886943,1824,Subscriber,1988.0,Female,Yes
241722,640,2018-03-22 09:02:05.0740,2018-03-22 09:12:45.3900,59,S Van Ness Ave at Market St,37.7748,-122.419,90,Townsend St at 7th St,37.7711,-122.403,1847,Subscriber,1985.0,Male,No
248478,216,2018-03-20 08:56:40.4290,2018-03-20 09:00:17.0200,108,16th St Mission BART,37.7647,-122.42,112,Harrison St at 17th St,37.7638,-122.413,617,Subscriber,1983.0,Male,No
716692,355,2018-06-17 10:19:26.7700,2018-06-17 10:25:22.6460,147,29th St at Tiffany Ave,37.7441,-122.421,133,Valencia St at 22nd St,37.7552,-122.421,120,Subscriber,1978.0,Male,No
1274019,651,2018-09-20 18:05:47.3230,2018-09-20 18:16:38.8980,5,Powell St BART Station (Market St at 5th St),37.7839,-122.408,8,The Embarcadero at Vallejo St,37.8,-122.399,3142,Subscriber,1994.0,Male,No
1832778,995,2018-12-07 15:34:41.8480,2018-12-07 15:51:17.3520,67,San Francisco Caltrain Station 2 (Townsend St...,37.7766,-122.396,58,Market St at 10th St,37.7766,-122.417,789,Subscriber,2000.0,Male,No


In [None]:
# View info of the dataframe
df.info(verbose=True, null_counts=True)

In [None]:
# Check if duplicates exist
df.duplicated().sum()

In [None]:
# View descriptive statistics of the dataframe
df.describe()

<a id='issues'></a>
**Quality issues**
 * start time and end time are objects not a timestamps
 * user type, gender and bike_share_for_all_trip can be set to category
 * bike id, start_station_id, end_station_id can be set to object
 * member birth year has dates prior to 1900
 * we can calculate the age of the user
 * we can further enhance the dataset with more details about the time like month, day, hour, weekday
 * we can calculate the distance for rides between stations

<a id='clean'></a>
## Part III - Cleaning Data

In [None]:
# Create copies of original DataFrames
df_clean = df.copy()

**Define**
<br>Set appropriate data types for fields mentioned in the [Quality issues](#issues)

**Code**

In [None]:
# set dates to timestamps
df_clean.start_time = pd.to_datetime(df_clean.start_time)
df_clean.end_time = pd.to_datetime(df_clean.end_time)

In [None]:
# set user type, gender and bike_share_for_all_trip to category
df_clean.user_type = df_clean.user_type.astype('category')
df_clean.member_gender = df_clean.member_gender.astype('category')
df_clean.bike_share_for_all_trip = df_clean.bike_share_for_all_trip.astype('category')

In [None]:
# set bike id, start_station_id, end_station_id to object
df_clean.bike_id = df_clean.bike_id.astype(str)
df_clean.start_station_id = df_clean.bike_id.astype(str)
df_clean.end_station_id = df_clean.bike_id.astype(str)

**Test**

In [None]:
df_clean.info(verbose=True, null_counts=True)

**Define**
<br>Calculate the age of the member

**Code**

In [None]:
# substract the birth year from the current year
df_clean['member_age'] = 2019-df_clean['member_birth_year']

**Test**

In [None]:
df_clean.head(20)

**Define**
<br>Enhance dataset with new date related fields

**Code**

In [None]:
# extract start time month name
df_clean['start_time_month_name']=df_clean['start_time'].dt.strftime('%B')

In [None]:
# extract start time month number
df_clean['start_time_month']=df_clean['start_time'].dt.month.astype(int)

In [None]:
# extract start time weekdays
df_clean['start_time_weekday']=df_clean['start_time'].dt.strftime('%a')

In [None]:
# extract start time day
df_clean['start_time_day']=df_clean['start_time'].dt.day.astype(int)

In [None]:
# extract start time hour
df_clean['start_time_hour']=df_clean['start_time'].dt.hour

**Test**

In [None]:
df_clean.head()

In [None]:
df_clean.info(verbose=True, null_counts=True)

In [None]:
# code for the age boxplot

plt.figure(figsize = [10, 4])
base_color = sns.color_palette()[0]

sns.boxplot(data=df_clean, x='member_age', color=base_color);

In [None]:
df_clean.member_age.mean()

In [None]:
df_clean.member_age.describe(percentiles = [ .95])

**Define**
<br>Remove age outliers. As mentioned in the [Quality issues](#issues), there are customers with the birth year before 1900 thus customers with age above 100 years. As 95% of the users are below 58 , I am going to keep users below 60.

**Code**

In [None]:
# Keep records below 60, it automatically removes null values
df_clean = df_clean.query('member_age <=60')

In [None]:
# change age and birth year to integer
df_clean.member_age = df_clean.member_age.astype(int)
df_clean.member_birth_year = df_clean.member_birth_year.astype(int)

**Test**

In [None]:
df_clean.describe()

In [None]:
df_clean.info(verbose=True, null_counts=True)

In [None]:
# save cleaned data 
df_clean.to_csv('clean_master_file.csv', index=False)

### What is the structure of your dataset?

Originally there were approx. 185,000 bike rides that happen in 2018 in the San Francisco Bay Area. The dataset contained features about:
 * trip duration: start/end time, how long the trip took in seconds
 * stations: start/end station, name, geolocation (latitude/longitude)
 * anonymized customer data: gender, birth date and user type
 * rented bikes: bike id

The dataset was further enhanced with features that I may find neccessary to perform interesting analysis:
 * rental time: month, day, hour of the day, weekday (both for start and end date)
 * customer: age

### What is/are the main feature(s) of interest in your dataset?

I'm most interested in figuring out when and where bikes are high in demand (during the day/weekday/month). Moreover which age range and gender uses the service the most and if the service is mostly used by members or casual riders.

### What features in the dataset do you think will help support your investigation into your feature(s) of interest?

I expect that the start time will be most exploited in my analysis as well as customer related data. I expect that location and datetime will have the strongest effect on bike demand.

<a id='univariate'></a>
## Part IV - Univariate Exploration

I'll start by looking at the monthly trend of number of bike rentals and distribution of weekdays and hours of the day. I will also explore the duration of the trips.

In [None]:
# monthly usege of the bike sharing system
g = sns.catplot(data=df_clean, x='start_time_month', kind='count', color = base_color)
g.set_axis_labels("Month", "#Bike Trips")
g.fig.suptitle('Monthly usage of the bike share system', y=1.03, fontsize=14, fontweight='semibold');

Winter months are the worst for the bike sharing system most probably due to the weather conditions. The bike renting is high in demand between May and October, reaching its peak in October, followed by July.

In [None]:
# weekday usege of the bike sharing system

weekday = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
g = sns.catplot(data=df_clean, x='start_time_weekday', kind='count', color = base_color, order = weekday)
g.set_axis_labels("Weekdays", "#Bike Trips")
g.fig.suptitle('Weekly usage of the bike share system', y=1.03, fontsize=14, fontweight='semibold');

The bike share system is mainly used during weekdays, with Tuesday - Thursday as the most popular days for bike rides. The system is most probably used as a daily work/school commute.

In [None]:
# hourly usege of the bike sharing system

g = sns.catplot(data=df_clean, x='start_time_hour', kind='count', color = base_color)
g.set_axis_labels("Hours", "#Bike Trips")
g.fig.suptitle('Hourly usage of the bike share system', y=1.03, fontsize=14, fontweight='semibold');

The hourly distribution is bimodal, the system is used mainly around 8-9am and 5-6pm when people get to and gat back from work.

In [None]:
# code for the (histogram) duration (sec) distribution per user type

bin_edges = np.arange(0, 3600,60)

plt.hist(data = df_clean, x = 'duration_sec', bins = bin_edges)

plt.title("Trip duration (sec) histogram", y=1.03, fontsize=14, fontweight='semibold')
plt.xlabel('Weekday')
plt.ylabel('Duration (sec)');

Looking at the histogram, we can see that trip durations are no longer than 30 min (1800 sec) and usually last 6 to 15 min. This can be explained by two facts:
1. The way the system works: single trips and 24h or 72h access pass are free of additional charge for trips up to 30 min, otherwise you pay extra $3 for additional 15 min. Only the monthly pass offers free of charge 45 min rides.
2. The way the system is used: as is looks like people use the system for commuting, they trips are usually short in time probably due to the closeness of their homes to workplace/school.

### Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

> There was one unusal points for the duration (sec), which in some cases lasted more than 24h. For the histogram I set the max range to 3600 sec = 60 min.

### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

> There was one unusal distribution for the member birth year, which in some cases was dated before 1900. Since 95% of the members are between 17 and 57 years, I removed users older than 60.


<a id='bivariate'></a>
## Part V - Bivariate Exploration

In this section I will further explore the dataset by adding the customer type to the analysis.

In [None]:
# calculating % split for the user type
customer = df_clean.query('user_type == "Customer"')['bike_id'].count()
subscriber = df_clean.query('user_type == "Subscriber"')['bike_id'].count()

customer_proportion = customer / df_clean['bike_id'].count()
subscriber_proportion = subscriber / df_clean['bike_id'].count()

In [None]:
plt.figure(figsize = [10, 5])

# code for the bar chart
plt.subplot(1, 2, 1)

g = sns.countplot(data=df_clean, x="user_type", order=df_clean.user_type.value_counts().index)
g.set_xlabel('User Type')
g.set_ylabel('#Bike Trips')

# code for the pie chart
plt.subplot(1, 2, 2)

labels = ['Customer', 'Subscriber']
sizes = [customer_proportion, subscriber_proportion]
colors = ['darkorange', 'steelblue']
explode = (0, 0.1)

plt.pie(sizes, explode=explode, labels=labels, colors = colors,
        autopct='%1.1f%%', shadow=True, startangle=90)
plt.axis('equal')

plt.suptitle('User type split for GoBike sharing system', y=1.03, fontsize=14, fontweight='semibold');

The bike sharing system is mainly used by subscribers (88%) than ocassional riders (12%).
<br>
<br>Next I am going to explore the renting trends per each user type.

In [None]:
# monthly usege of the bike sharing system per user type
g = sns.catplot(data=df_clean, x='start_time_month', col="user_type", kind='count', sharey = False,
            color = base_color)
g.set_axis_labels("Month", "#Bike Trips")
g.set_titles("{col_name}")
g.fig.suptitle('Monthly usage of the bike share system per user type', y=1.03, fontsize=14, fontweight='semibold');

Winter months are the worst for the bike sharing system for both groups what can be determined by the harsher weather.
<br>
<br>For **Customers,** the bike renting is high in demand around summertime, reaching its peak in July. Customers are most probably occasional reiders or tourist coming to visit the Bay Area. For **Subscribers,** the highest demand is from May till October, reaching it's peak in October. Customers are most probably regular riders using bikes for a daily commute.

In [None]:
# weekday usege of the bike sharing system per user type

weekday = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
g = sns.catplot(data=df_clean, x='start_time_weekday', col="user_type", kind='count', sharey = False,
            color = base_color, order = weekday)
g.set_axis_labels("Weekday", "#Bike Trips")
g.set_titles("{col_name}")
g.fig.suptitle('Weekly usage of the bike share system per user type', y=1.03, fontsize=14, fontweight='semibold');

There is a different trend on which days customers and subscribers rent bikes. As mentioned above **customers** are most probably occasional riders and tourist who use the bike sharing system on holiday or weekend trips. On the other hand, **subscribers** are most probably daily work/school commuters who use the system within a week.
<br>
<br>Next, I am going to check when within a day bikes are high in demand.

In [None]:
# hourly usege of the bike sharing system per user type

g = sns.catplot(data=df_clean, x='start_time_hour', col="user_type", kind='count', sharey = False,
            color = base_color)
g.set_axis_labels("Hour", "#Bike Trips")
g.set_titles("{col_name}")
g.fig.suptitle('Hourly usage of the bike share system per user type', y=1.03, fontsize=14, fontweight='semibold');

There is also a different trend of when during the day bikes are rented most often. **Customers** use bikes mainly between 8 am - 7 pm, reaching the renting peak around 5pm. **Subscribers** on the other side use the system at around 8-9am and 5-6pm when they go and come back from work.
<br>
<br>Next, I am going to check how the trip duration varies between customers and subscribers.

In [None]:
# code for the (histogram) duration (sec) distribution per user type

g = sns.FacetGrid(df_clean, col="user_type", margin_titles=True, size=5)
bin_edges = np.arange(0, 3600,60)
g.map(plt.hist, "duration_sec", color=base_color, bins=bin_edges)
g.set_axis_labels("Duration (sec)", "#Bike Trips")
g.set_titles(col_template = '{col_name}')
g.fig.suptitle('Trip duration (sec) histogram per user type', y=1.03, fontsize=14, fontweight='semibold');

In [None]:
# code for the (boxplot) duration (sec) distribution per user type

data = df_clean.query('duration_sec < 3600')
g = sns.catplot(data=data, y='duration_sec', col="user_type", kind='box',
            color = base_color)
g.set_titles(col_template = '{col_name}')
g.set_axis_labels("", "Trip duration (sec)")
g.fig.suptitle('Trip duration (sec) boxplot per user type', y=1.03, fontsize=14, fontweight='semibold');

Looking at both charts (histograms and box plots), we can see that trip durations are longer for customers (9 to 23 minutes) than for subscribers (7 to 13 minutes). This can probably be explained by the fact that subscribers are mainly commuters who take short trips to work/school rather than longer trips around the Bay Area.

### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

> Adding the user type to the analysis depicted different usage behaviours between customers and subscribers. As mentioned above **customers** are casual riders, most probably tourists who rent bikes mainly in summertime (the peak in July), more often during weekends than weekdays and they rent bikes more often within the day rather than around commute hours (8-9am and 5-6pm). **Subscribers** are daily commuters, who also use the system around summertime, May-October (with the peak in October). They rent bikes more often during weekdays than weekends and mainly around the time they go and go back from work or school (8-9am and 5-6pm).

### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

> There is a difference in the trip duration between customers and subscribers. **Customers** trips are usually longer than for subscribers, most probably due to the fact they prefer bike rides around weekends in summertime, what encourages longer trips around the area. **Subscribers** on the other hand use the system mainly for commute purposes so they rather prefer quick rides to and from work/school.

<a id='multivariate'></a>
## Part VI - Multivariate Exploration

In this section I will further explore the dataset by adding gender to the customer type and check the hourly distribution of bike rides during weekdays for customers and subscribers.

In [None]:
plt.figure(figsize = [10, 5])

# code for the bar chart
plt.subplot(1, 2, 1)

g = sns.countplot(data=df_clean, x="user_type", hue="member_gender", order=df_clean.user_type.value_counts().index)
g.set_xlabel('User Type')
g.set_ylabel('#Bike Trips');

In general, males are using the system more often than females and others (the registration system allows you to choose 'Other' as a gender). However, the ratio is much smaller between males and females for **customers** (more ore less 2:1) than for **subscribers** (3:1).
<br>
<br>Let's explore if gender affects the way the bike system is used within a year, weekdays and hours of the day.

In [None]:
# monthly usege of the bike sharing system per user type and gender

g = sns.catplot(data=df_clean, x='start_time_month', col="user_type", hue="member_gender", kind='count', sharey = False)
g.set_axis_labels("Month", "#Bike Trips")
g._legend.set_title('Gender')
g.set_titles("{col_name}")
g.fig.suptitle('Monthly usage of the bike share system per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

The trend is very similar for males and females: for **customers,** the highest demand is around summertime, reaching its peak in July; for **subscribers,** the highest demand is from May till October, reaching it's peak in October. Suprisingly, for **customers** there are quite a lot of females using the system between January and March in comparison to males - the ratio is much smaller than for the rest of the year.

In [None]:
# weekday usege of the bike sharing system per user type and gender

weekday = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
g = sns.catplot(data=df_clean, x='start_time_weekday', col="user_type", hue='member_gender', kind='count', sharey = False,
                order = weekday)
g.set_axis_labels("Weekday", "#Bike Trips")
g._legend.set_title('Gender')
g.set_titles("{col_name}")
g.fig.suptitle('Weekly usage of the bike share system per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

As in the previous section, the trend is very similar for males and females: **customers** use the system more often during weekends than within a week (although the jump in bike used on weekends is much higher for females than for males); **subscribers** use the system mainly within a week.

In [None]:
# hourly usege of the bike sharing system per user type and gender

g = sns.catplot(data=df_clean, x='start_time_hour', col="user_type", hue='member_gender', kind='count', sharey = False)
g.set_axis_labels("Hour", "#Bike Trips")
g._legend.set_title('Gender')
g.set_titles("{col_name}")
g.fig.suptitle('Hourly usage of the bike share system per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

During the day, both males and females use the system the same way: **customers** use bikes mainly between 8 am - 7 pm; **subscribers** on the other side use the system at around 8-9am and 5-6pm when they go and come back from work.

In [None]:
# code for the (violinplot) duration (sec) distribution per user type and gender

g = sns.catplot(data=data, x='user_type', y="duration_sec", hue="member_gender", kind="violin")

g.set_axis_labels("User Type", "Trip duration (sec)")
g._legend.set_title('Gender')
g.fig.suptitle('Trip duration per user type and gender', y=1.03, fontsize=14, fontweight='semibold');

Here we can observe that in both cases, females take longer trips (measured in time) than males and other. The difference is more visible for **customers** (~13 min for males and other vs ~15 for females) than for **subscribers** (the difference is quite small).

In [None]:
# Setting the weekday order
df_clean['start_time_weekday'] = pd.Categorical(df_clean['start_time_weekday'], 
                                                categories=['Mon','Tue','Wed','Thu','Fri','Sat', 'Sun'], 
                                                ordered=True)
plt.figure(figsize=(9,8))
plt.suptitle('Hourly usage during the weekday for customers and subscribers', fontsize=14, fontweight='semibold')

# heatmap for customers
plt.subplot(1, 2, 1)
df_customer = df_clean.query('user_type == "Customer"').groupby(["start_time_hour", "start_time_weekday"])["bike_id"].size().reset_index()
df_customer = df_customer.pivot("start_time_hour", "start_time_weekday", "bike_id")
sns.heatmap(df_customer, cmap="BuPu")

plt.title("Customer", y=1.015)
plt.xlabel('Weekday')
plt.ylabel('Start Time Hour')

# heatmap for subscribers
plt.subplot(1, 2, 2)
df_subscriber = df_clean.query('user_type == "Subscriber"').groupby(["start_time_hour", "start_time_weekday"])["bike_id"].size().reset_index()
df_subscriber = df_subscriber.pivot("start_time_hour", "start_time_weekday", "bike_id")
sns.heatmap(df_subscriber, cmap="BuPu")

plt.title("Subscriber", y=1.015)
plt.xlabel('Weekday')
plt.ylabel('');

The plot perfectly summarizes in one place the diffrent trends for customers and subscribers I was writing up before.
<br>
#### Customers use the bike sharing system more often on weekends:
 * weekdays: most bike rides hapen around 8-9am and 5-6pm with the peak on Fridays around 5pm
 * weekends: most bike rides happen between 10am - 8pm with the peak on Saturdays around 2pm

#### Subscribers use the bike sharing system mainly on weekdays:
 * weekdays: most bike rides hapen around 8-9am and 5-6pm with the peak on Tuesdays around 8am
 * weekends: bikes are still rented but there is a significant drop in numbers of rented bikes throughout the entire weekends

### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

> Plotting a heatmap of when bikes are high in demand throughout the day on each weekday shed a new light on the customers behaviour. Plotting #bike trips throughout the day and #bike trips within the weekdays separately gave the impression that the demand for bikes is quite high throughout the day with a peak around 5pm which is not entirely true. The trend within weekdays for customers follows (although customers are rather not early birds) the one for subscribers who rent bikes mainly around commute hours (8-9am and 5-6pm). For customers, as depicted in univariate explorations, most of the trips happen on weekends but mainly between 10am - 8pm with the peak on Saturdays around 2pm, what was previosly not visible.

### Were there any interesting or surprising interactions between features?

> I have also checked if there is a trend difference for genders for each user group. There are not much of the differences in trends but surprisingly there are quite a lot of females using the system between January and March in comparison to males - the ratio (male:female) is much smaller than for the rest of the year. Moreover females take longer trips (measured in time) than males and others.