# **Bike Sharing Analysis**

Dataset : Bike Sharing Dataset ([Source](https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset))

# **1. Library Import**

*Library* [`pandas`](https://pandas.pydata.org) to carry out processing, analysis and manipulation of data.

*Library* [`matplotlib`](https://matplotlib.org/) to do visualization.

*Library* [`seaborn`](https://seaborn.pydata.org/) to drawing attractive and informative statistical graphics.

*Library* [`ZipFile`](https://docs.python.org/3/library/zipfile.html) to extract zip file.

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from zipfile import ZipFile
import numpy as np

# *2. Data Wrangling*

## *2.1 Gathering Data*

In [5]:
with ZipFile("Bike-sharing-dataset.zip", 'r') as zipObj:
    zipObj.extractall()

In [41]:
data_day = pd.read_csv('day.csv')
data_hour = pd.read_csv('hour.csv')
data_day, data_hour

(     instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
 0          1  2011-01-01       1   0     1        0        6           0   
 1          2  2011-01-02       1   0     1        0        0           0   
 2          3  2011-01-03       1   0     1        0        1           1   
 3          4  2011-01-04       1   0     1        0        2           1   
 4          5  2011-01-05       1   0     1        0        3           1   
 ..       ...         ...     ...  ..   ...      ...      ...         ...   
 726      727  2012-12-27       1   1    12        0        4           1   
 727      728  2012-12-28       1   1    12        0        5           1   
 728      729  2012-12-29       1   1    12        0        6           0   
 729      730  2012-12-30       1   1    12        0        0           0   
 730      731  2012-12-31       1   1    12        0        1           1   
 
      weathersit      temp     atemp       hum  windspeed  casual  registe

There are 2 files resulting from zipfile extraction, namely day.csv and hour.csv. For day.csv data, there are 731 rows and 16 columns. For hour.csv data, there are 17379 rows and 17 columns. Attribute information can be accessed via the [Readme.txt](Readme.txt) file

## *2.2 Assessing Data*

In [42]:
data_day.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [43]:
data_hour.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [44]:
print(data_day.duplicated().sum())

0


In [45]:
print(data_hour.duplicated().sum())

0


In [46]:
data_hour.info()
np.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


<module 'numpy.dtypes' from 'C:\\Users\\fikri\\PycharmProjects\\SkripsiHHO\\.venv\\Lib\\site-packages\\numpy\\dtypes.py'>

In [48]:
def outlier_check(data):
    outlier = {}
    for column in data.select_dtypes(include=np.number).columns:
        if column == 'holiday':
            continue
        q25 , q75 = np.percentile(data[column], 25), np.percentile(data[column], 75)
        iqr = q75 - q25
        lower_outlier = q25 - 1.5 * iqr
        high_outlier = q75 + 1.5 * iqr
        outliers = data[(data[column] < lower_outlier) | (data[column] > high_outlier)][column]
        outlier[column] = len(outliers)
    total = sum(outlier.values())
    return outlier, total

day_outlier, total_day = outlier_check(data_day)
hour_outlier, total_hour = outlier_check(data_hour)
print(f"Total outlier in day data: {total_day}")
print(f"Total outlier in hour data: {total_hour}")
day_outlier, hour_outlier

Total outlier in day data: 59
Total outlier in hour data: 2744


({'instant': 0,
  'season': 0,
  'yr': 0,
  'mnth': 0,
  'weekday': 0,
  'workingday': 0,
  'weathersit': 0,
  'temp': 0,
  'atemp': 0,
  'hum': 2,
  'windspeed': 13,
  'casual': 44,
  'registered': 0,
  'cnt': 0},
 {'instant': 0,
  'season': 0,
  'yr': 0,
  'mnth': 0,
  'hr': 0,
  'weekday': 0,
  'workingday': 0,
  'weathersit': 3,
  'temp': 0,
  'atemp': 0,
  'hum': 22,
  'windspeed': 342,
  'casual': 1192,
  'registered': 680,
  'cnt': 505})

1. day.csv and hour.csv Data don't have any null value.
2. day.csv and hour.csv Data don't have any duplicate value.
3. day.csv data has 80 outliers and hour.csv data has 3244 outliers.

## *2.3 Cleaning Data*

In [49]:
def remove_outliers(data):
    row_remove = set()
    for column in data.select_dtypes(include=np.number).columns:
        if column == 'holiday':
            continue
        q25, q75 = np.percentile(data[column], 25), np.percentile(data[column], 75)
        iqr = q75 - q25
        lower_outlier = q25 - 1.5 * iqr
        high_outlier = q75 + 1.5 * iqr
        outlier = data[(data[column] < lower_outlier) | (data[column] > high_outlier)].index
        row_remove.update(outlier)
    cleaned_data = data.drop(index=row_remove)
    return cleaned_data

def print_outlier_summary(data, cleaned_data, dataset_name):
    rows_before = len(data)
    rows_after = len(cleaned_data)
    rows_removed = rows_before - rows_after
    print(f"Dataset: {dataset_name}")
    print(f"Number of rows before cleanup: {rows_before}")
    print(f"Number of rows deleted (containing outliers): {rows_removed}")
    print(f"Number of rows after cleanup: {rows_after}")
    print("-" * 50)

data_day_cleaned = remove_outliers(data_day)
data_hour_cleaned = remove_outliers(data_hour)
print_outlier_summary(data_day, data_day_cleaned, "day.csv")
print_outlier_summary(data_hour, data_hour_cleaned, "hour.csv")

Dataset: day.csv
Number of rows before cleanup: 731
Number of rows deleted (containing outliers): 58
Number of rows after cleanup: 673
--------------------------------------------------
Dataset: hour.csv
Number of rows before cleanup: 17379
Number of rows deleted (containing outliers): 2162
Number of rows after cleanup: 15217
--------------------------------------------------


From the initial check on the day.csv data, there were 59 outliers and when the outliers were cleaned, there were 58 rows that had been removed from the data, this was because there was a possibility that in 1 row there were 2 or 3 outlier columns. While for the hour.csv data, there were 2162 outliers that had been removed from the data.

# **3. Exploratory Data Analysis**

Exploratory Data Analysis is the stage of exploring data that has been cleaned to gain insight and answer analysis questions.

In [50]:
data_day = data_day_cleaned
data_hour = data_hour_cleaned
data_day, data_hour

(     instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
 0          1  2011-01-01       1   0     1        0        6           0   
 1          2  2011-01-02       1   0     1        0        0           0   
 2          3  2011-01-03       1   0     1        0        1           1   
 3          4  2011-01-04       1   0     1        0        2           1   
 4          5  2011-01-05       1   0     1        0        3           1   
 ..       ...         ...     ...  ..   ...      ...      ...         ...   
 726      727  2012-12-27       1   1    12        0        4           1   
 727      728  2012-12-28       1   1    12        0        5           1   
 728      729  2012-12-29       1   1    12        0        6           0   
 729      730  2012-12-30       1   1    12        0        0           0   
 730      731  2012-12-31       1   1    12        0        1           1   
 
      weathersit      temp     atemp       hum  windspeed  casual  registe

In [51]:
data_day.info()
data_hour.info()

<class 'pandas.core.frame.DataFrame'>
Index: 673 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     673 non-null    int64  
 1   dteday      673 non-null    object 
 2   season      673 non-null    int64  
 3   yr          673 non-null    int64  
 4   mnth        673 non-null    int64  
 5   holiday     673 non-null    int64  
 6   weekday     673 non-null    int64  
 7   workingday  673 non-null    int64  
 8   weathersit  673 non-null    int64  
 9   temp        673 non-null    float64
 10  atemp       673 non-null    float64
 11  hum         673 non-null    float64
 12  windspeed   673 non-null    float64
 13  casual      673 non-null    int64  
 14  registered  673 non-null    int64  
 15  cnt         673 non-null    int64  
dtypes: float64(4), int64(11), object(1)
memory usage: 89.4+ KB
<class 'pandas.core.frame.DataFrame'>
Index: 15217 entries, 0 to 17378
Data columns (total 17 colu

Because the data type in the 'dteday' column is still an object type, it must be converted to datetime.

In [52]:
data_day['dteday'] = pd.to_datetime(data_day['dteday'], errors='coerce')
data_hour['dteday'] = pd.to_datetime(data_hour['dteday'], errors='coerce')
data_hour, data_day

(       instant     dteday  season  yr  mnth  hr  holiday  weekday  workingday  \
 0            1 2011-01-01       1   0     1   0        0        6           0   
 1            2 2011-01-01       1   0     1   1        0        6           0   
 2            3 2011-01-01       1   0     1   2        0        6           0   
 3            4 2011-01-01       1   0     1   3        0        6           0   
 4            5 2011-01-01       1   0     1   4        0        6           0   
 ...        ...        ...     ...  ..   ...  ..      ...      ...         ...   
 17374    17375 2012-12-31       1   1    12  19        0        1           1   
 17375    17376 2012-12-31       1   1    12  20        0        1           1   
 17376    17377 2012-12-31       1   1    12  21        0        1           1   
 17377    17378 2012-12-31       1   1    12  22        0        1           1   
 17378    17379 2012-12-31       1   1    12  23        0        1           1   
 
        weathe

In [53]:
data_day.describe(include='all')

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,673.0,673,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0,673.0
mean,357.989599,2011-12-23 23:45:01.337295872,2.503715,0.475483,6.557207,0.026746,2.962853,0.728083,1.40416,0.48942,0.469105,0.632846,0.186662,738.934621,3628.787519,4367.72214
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.275833,0.022392,2.0,20.0,22.0
25%,175.0,2011-06-24 00:00:00,1.0,0.0,3.0,0.0,1.0,0.0,1.0,0.329167,0.326379,0.524583,0.134329,304.0,2482.0,3068.0
50%,349.0,2011-12-15 00:00:00,3.0,0.0,7.0,0.0,3.0,1.0,1.0,0.484167,0.47095,0.630833,0.178479,678.0,3614.0,4401.0
75%,545.0,2012-06-28 00:00:00,4.0,1.0,10.0,0.0,5.0,1.0,2.0,0.653333,0.607958,0.734583,0.230725,1031.0,4709.0,5633.0
max,731.0,2012-12-31 00:00:00,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.378108,2258.0,6946.0,8173.0
std,212.108768,,1.123505,0.49977,3.505108,0.16146,1.927276,0.445278,0.548358,0.185105,0.16461,0.140467,0.072436,523.019213,1578.680984,1863.248953


In [55]:
data_hour.describe(include='all')

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,15217.0,15217,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0,15217.0
mean,8445.624367,2011-12-22 23:07:17.431819776,2.486233,0.476112,6.518959,11.162056,0.027469,2.991786,0.714661,1.442597,0.481585,0.462662,0.64464,0.180524,23.760597,124.004666,147.765263
min,1.0,2011-01-01 00:00:00,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.08,0.0,0.0,0.0,1.0
25%,4112.0,2011-06-25 00:00:00,1.0,0.0,3.0,5.0,0.0,1.0,0.0,1.0,0.32,0.3182,0.5,0.1045,3.0,27.0,32.0
50%,8274.0,2011-12-16 00:00:00,2.0,0.0,7.0,11.0,0.0,3.0,1.0,1.0,0.48,0.4697,0.65,0.1642,13.0,99.0,118.0
75%,12754.0,2012-06-20 00:00:00,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.64,0.6061,0.81,0.2537,37.0,184.0,228.0
max,17379.0,2012-12-31 00:00:00,4.0,1.0,12.0,23.0,1.0,6.0,1.0,3.0,1.0,1.0,1.0,0.4627,114.0,499.0,594.0
std,5025.884121,,1.124521,0.499445,3.516137,7.156066,0.163452,1.950929,0.45159,0.646215,0.190894,0.170614,0.188465,0.112053,26.555663,113.596048,131.073174


In [56]:
data_day.corr()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,1.0,1.0,0.420376,0.862844,0.505732,0.017736,-0.029141,0.07136,-0.005337,0.137097,0.138974,0.014073,-0.119028,0.22928,0.656008,0.620177
dteday,1.0,1.0,0.420376,0.862844,0.505732,0.017736,-0.029141,0.07136,-0.005337,0.137097,0.138974,0.014073,-0.119028,0.22928,0.656008,0.620177
season,0.420376,0.420376,1.0,0.00215,0.832133,-0.016955,0.007967,0.027306,0.021714,0.334904,0.342369,0.186332,-0.213489,0.251775,0.412653,0.420304
yr,0.862844,0.862844,0.00215,1.0,0.002288,0.008138,-0.041888,0.080333,-0.034379,0.029705,0.028214,-0.107701,-0.024036,0.175276,0.590436,0.549461
mnth,0.505732,0.505732,0.832133,0.002288,1.0,0.020957,0.014744,0.003784,0.049116,0.221118,0.227879,0.209617,-0.195992,0.15739,0.296159,0.295108
holiday,0.017736,0.017736,-0.016955,0.008138,0.020957,1.0,-0.097227,-0.271261,-0.038235,-0.063366,-0.067519,-0.022004,0.036498,0.015405,-0.119956,-0.097312
weekday,-0.029141,-0.029141,0.007967,-0.041888,0.014744,-0.097227,1.0,0.071445,0.039572,-0.005307,-0.01387,-0.029672,-0.000702,0.016215,0.034198,0.033527
workingday,0.07136,0.07136,0.027306,0.080333,0.003784,-0.271261,0.071445,1.0,0.036331,0.119045,0.117871,0.008533,-0.016905,-0.413574,0.376946,0.203285
weathersit,-0.005337,-0.005337,0.021714,-0.034379,0.049116,-0.038235,0.039572,0.036331,1.0,-0.120212,-0.122206,0.627726,0.075233,-0.289264,-0.26124,-0.302539
temp,0.137097,0.137097,0.334904,0.029705,0.221118,-0.063366,-0.005307,0.119045,-0.120212,1.0,0.991483,0.122486,-0.139599,0.595525,0.54512,0.629031


In [57]:
data_hour.corr()

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
instant,1.0,0.999995,0.400986,0.860045,0.494074,-0.025045,0.018672,-0.006328,-0.003383,0.000621,0.119202,0.120837,0.032193,-0.087319,0.126526,0.192082,0.192104
dteday,0.999995,1.0,0.401437,0.859656,0.494741,-0.026456,0.018678,-0.006294,-0.003353,0.000734,0.119222,0.120885,0.03281,-0.087419,0.12603,0.191582,0.191571
season,0.400986,0.401437,1.0,-0.021385,0.826483,-0.015126,-0.019396,-0.001804,0.027724,-0.011646,0.32226,0.328319,0.160404,-0.136558,0.146942,0.163297,0.171294
yr,0.860045,0.859656,-0.021385,1.0,-0.016686,-0.025819,0.013674,-0.012511,-0.00662,-0.003383,0.014101,0.012924,-0.06188,-0.027496,0.091395,0.152968,0.151088
mnth,0.494074,0.494741,0.826483,-0.016686,1.0,-0.008425,0.01293,0.009464,0.004235,0.006734,0.210106,0.21586,0.168872,-0.125975,0.092533,0.116821,0.119992
hr,-0.025045,-0.026456,-0.015126,-0.025819,-0.008425,1.0,-0.003188,-0.006909,0.032979,-0.006686,0.10123,0.099247,-0.25426,0.123155,0.384894,0.424106,0.445536
holiday,0.018672,0.018678,-0.019396,0.013674,0.01293,-0.003188,1.0,-0.098836,-0.265976,-0.016181,-0.056875,-0.060876,-0.009792,0.019251,-0.002028,-0.055298,-0.048336
weekday,-0.006328,-0.006294,-0.001804,-0.012511,0.009464,-0.006909,-0.098836,1.0,0.048139,-0.00113,0.001846,-0.004227,-0.023487,0.002066,0.021563,0.03168,0.031825
workingday,-0.003383,-0.003353,0.027724,-0.00662,0.004235,0.032979,-0.265976,0.048139,1.0,0.026745,0.138843,0.137255,-0.040503,0.008151,-0.078118,0.206753,0.163358
weathersit,0.000621,0.000734,-0.011646,-0.003383,0.006734,-0.006686,-0.016181,-0.00113,0.026745,1.0,-0.085985,-0.090587,0.423728,0.033604,-0.146071,-0.096411,-0.11315
