# Exploratory Data Analysis on bike sharing company's data

## Introduction

This notebook presents an Exploratory Data Analysis (EDA) on a bike sharing company. The dataset contains 17k+ observations with 17 features, collected from [link capitalbikeshare-data](https://s3.amazonaws.com/capitalbikeshare-data/index.html).

The aim of this analysis is to understand the data and gain insights into the bike sharing company's business. For instance, if you are a newly hired employee at a leadership position at this company, you will be interested in goals such customer satisfaction, employee morale, brand recognition, market share maximization, cost reduction, or revenue growth. Through this eda, the aim is to find patterns and make meaning full insights and observations to improve the above mentions goals and make important decisions. For all the above, its important to deep dive into the data. 

## Dataset Description

Here is a brief description of the features present in the dataset:

- `Feature 1`: [Description of feature 1]
- `Feature 2`: [Description of feature 2]
- `Feature 3`: [Description of feature 3]
- ...

## EDA Outline

1. **Data Cleaning**: Check for missing data and handle them appropriately. Also, look for inconsistencies in the data and solve them.

## Credit

1. **Book**: Dive into Data Science Bradford Tuckfield

The original source of the bike-sharing data is [link Capital Bikeshare](https://ride.capitalbikeshare.com/system-data). The data was compiled and augmented by Hadi Fanaee-T and Joao Gama and posted online by Mark Kaghazgarian. 


In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Dataset inspection

In [41]:
df = hour = pd.read_csv("../data/hours.csv")
df.head(24)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,count
0,1,2011-01-01,1,0,1,0,0,6,0,1,0.24,0.2879,0.81,0.0,3,13,16
1,2,2011-01-01,1,0,1,1,0,6,0,1,0.22,0.2727,0.8,0.0,8,32,40
2,3,2011-01-01,1,0,1,2,0,6,0,1,0.22,0.2727,0.8,0.0,5,27,32
3,4,2011-01-01,1,0,1,3,0,6,0,1,0.24,0.2879,0.75,0.0,3,10,13
4,5,2011-01-01,1,0,1,4,0,6,0,1,0.24,0.2879,0.75,0.0,0,1,1
5,6,2011-01-01,1,0,1,5,0,6,0,2,0.24,0.2576,0.75,0.0896,0,1,1
6,7,2011-01-01,1,0,1,6,0,6,0,1,0.22,0.2727,0.8,0.0,2,0,2
7,8,2011-01-01,1,0,1,7,0,6,0,1,0.2,0.2576,0.86,0.0,1,2,3
8,9,2011-01-01,1,0,1,8,0,6,0,1,0.24,0.2879,0.75,0.0,1,7,8
9,10,2011-01-01,1,0,1,9,0,6,0,1,0.32,0.3485,0.76,0.0,8,6,14


In [12]:
df.shape

(17379, 17)

In [13]:
df.columns

Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'hr', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed',
       'casual', 'registered', 'count'],
      dtype='object')

## Summary Statistics

Rough measurement of the size of the business over the two years

In [31]:
hour.describe()


Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,count
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0


In addition to the mean above, we calculate the other metrices

In [37]:

desc = hour.describe()
print(f"The 25% of hours in the dataset has {desc.loc['25%', 'count']} or fewer bikes.")
print(f"The 75% of the hours in the data has {desc.loc['75%', 'count']} bikes.")
print(f"The minimum registered number at an hour is: {hour['registered'].min()}")
print(f"THe maxium registered number at an hour is: {hour['registered'].max()}")

The 25% of hours in the dataset has 40.0 or fewer bikes.
The 75% of the hours in the data has : 281.0 bikes.
The minimum registered number at an hour is: 0
THe maxium registered number at an hour is: 886


## Observations so far
1. The above matrics are important to understand. The number of registered users ranges from 0 - 886, and this tells us the hourly record that you will need to break if you want your business to do better than before. 
2. The riders are much less at night time and improve the day time.
3. There is much wide variation in the hourly count of users: 25 percent of hours have fewer than 40 rider but there is one hour that had 886 riders. As a leader, you will need to ensure to get closer to this higher number of riders and have much lower number in the 25% percentile. 
4. You could lower the price of the ride at night time to lure more riders. 

## Night time data
#### Assumptions
1. Let's assume that night time is from midnight onwards and before 5AM.

In [52]:
# Check all the nighttime data for registered users
all_nighttimes_registerd_users_mean = hour.loc[hour["hr"] < 5, "registered"].mean()
all_daytimes_registerd_users_mean = hour \
        .loc[(hour["hr"] > 5) & (hour["hr"] < 18) , "registered"] \
        .mean()
print(f"The average of all night time registered riders is {round(all_nighttimes_registerd_users_mean, 2)}.")
print(f"The average of all day time registered riders is {round(all_daytimes_registerd_users_mean, 2)}.")

The average of all night time registered riders is 20.79.
The average of all day time registered riders is 200.75.


The above variation is quite significant and needs to be improved in the night time. Even though, at night there are roads are less crowded but there is room for an improvement. 

## Seasonal Data
In this dataset, the season variable represents 4 seasons in an year.
1. Winter: 1
2. Sprint: 2
3. Summer: 3
4. Fall or autumn: 4

Find the average number of riders during each of this season.

In [58]:
hour.groupby(["season"])["count"].mean()

season
1    111.114569
2    208.344069
3    236.016237
4    198.868856
Name: count, dtype: float64

## Observation

There is a definite pattern in the above output. THere are more riders in the spring, fall and summer seasons. 

Add holiday variable to view the holidays and non holidays periods along with the seasons.

In [None]:
hour.groupby(["season", "holiday"]).[count].means()