# **Bike Sharing Analysis**

Dataset : Bike Sharing Dataset ([Source](https://www.kaggle.com/datasets/lakshmi25npathi/bike-sharing-dataset))

# **1. Library Import**

*Library* [`pandas`](https://pandas.pydata.org) to carry out processing, analysis and manipulation of data.

*Library* [`matplotlib`](https://matplotlib.org/) to do visualization.

*Library* [`seaborn`](https://seaborn.pydata.org/) to drawing attractive and informative statistical graphics.

*Library* [`ZipFile`](https://docs.python.org/3/library/zipfile.html) to extract zip file.

In [14]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from zipfile import ZipFile
import numpy as np

# *2. Data Wrangling*

## *2.1 Gathering Data*

In [5]:
with ZipFile("Bike-sharing-dataset.zip", 'r') as zipObj:
    zipObj.extractall()

In [6]:
data_day = pd.read_csv('day.csv')
data_hour = pd.read_csv('hour.csv')
data_day, data_hour

(     instant      dteday  season  yr  mnth  holiday  weekday  workingday  \
 0          1  2011-01-01       1   0     1        0        6           0   
 1          2  2011-01-02       1   0     1        0        0           0   
 2          3  2011-01-03       1   0     1        0        1           1   
 3          4  2011-01-04       1   0     1        0        2           1   
 4          5  2011-01-05       1   0     1        0        3           1   
 ..       ...         ...     ...  ..   ...      ...      ...         ...   
 726      727  2012-12-27       1   1    12        0        4           1   
 727      728  2012-12-28       1   1    12        0        5           1   
 728      729  2012-12-29       1   1    12        0        6           0   
 729      730  2012-12-30       1   1    12        0        0           0   
 730      731  2012-12-31       1   1    12        0        1           1   
 
      weathersit      temp     atemp       hum  windspeed  casual  registe

There are 2 files resulting from zipfile extraction, namely day.csv and hour.csv. For day.csv data, there are 731 rows and 16 columns. For hour.csv data, there are 17379 rows and 17 columns. Attribute information can be accessed via the [Readme.txt](Readme.txt) file

## *2.2 Assessing Data*

In [9]:
data_day.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [8]:
data_hour.isnull().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
hr            0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

In [12]:
print(data_day.duplicated().sum())

0


In [13]:
print(data_hour.duplicated().sum())

0


In [22]:
data_hour.info()
np.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB


<module 'numpy.dtypes' from 'C:\\Users\\fikri\\PycharmProjects\\SkripsiHHO\\.venv\\Lib\\site-packages\\numpy\\dtypes.py'>

In [27]:
def outlier_check(data):
    outlier = {}
    for column in data.select_dtypes(include=np.number).columns:
        q25 , q75 = np.percentile(data[column], 25), np.percentile(data[column], 75)
        iqr = q75 - q25
        lower_outlier = q25 - 1.5 * iqr
        high_outlier = q75 + 1.5 * iqr
        outliers = data[(data[column] < lower_outlier) | (data[column] > high_outlier)][column]
        outlier[column] = len(outliers)
    total = sum(outlier.values())
    return outlier, total

day_outlier, total_day = outlier_check(data_day)
hour_outlier, total_hour = outlier_check(data_hour)
print(f"Total outlier in day data: {total_day}")
print(f"Total outlier in hour data: {total_hour}")
day_outlier, hour_outlier

Total outlier in day data: 80
Total outlier in hour data: 3244


({'instant': 0,
  'season': 0,
  'yr': 0,
  'mnth': 0,
  'holiday': 21,
  'weekday': 0,
  'workingday': 0,
  'weathersit': 0,
  'temp': 0,
  'atemp': 0,
  'hum': 2,
  'windspeed': 13,
  'casual': 44,
  'registered': 0,
  'cnt': 0},
 {'instant': 0,
  'season': 0,
  'yr': 0,
  'mnth': 0,
  'hr': 0,
  'holiday': 500,
  'weekday': 0,
  'workingday': 0,
  'weathersit': 3,
  'temp': 0,
  'atemp': 0,
  'hum': 22,
  'windspeed': 342,
  'casual': 1192,
  'registered': 680,
  'cnt': 505})

1. day.csv and hour.csv Data don't have any null value.
2. day.csv and hour.csv Data don't have any duplicate value.
3. day.csv data has 80 outliers and hour.csv data has 3244 outliers.

## *2.3 Cleaning Data*

In [29]:
def remove_outliers(data):
    row_remove = set()
    for column in data.select_dtypes(include=np.number).columns:
        q25, q75 = np.percentile(data[column], 25), np.percentile(data[column], 75)
        iqr = q75 - q25
        lower_outlier = q25 - 1.5 * iqr
        high_outlier = q75 + 1.5 * iqr
        outlier = data[(data[column] < lower_outlier) | (data[column] > high_outlier)].index
        row_remove.update(outlier)
    cleaned_data = data.drop(index=row_remove)
    return cleaned_data

def print_outlier_summary(df, cleaned_df, dataset_name):
    rows_before = len(df)
    rows_after = len(cleaned_df)
    rows_removed = rows_before - rows_after
    print(f"Dataset: {dataset_name}")
    print(f"Jumlah baris sebelum pembersihan: {rows_before}")
    print(f"Jumlah baris dihapus (mengandung outlier): {rows_removed}")
    print(f"Jumlah baris setelah pembersihan: {rows_after}")
    print("-" * 50)

data_day_cleaned = remove_outliers(data_day)
data_hour_cleaned = remove_outliers(data_hour)
print_outlier_summary(data_day, data_day_cleaned, "day.csv")
print_outlier_summary(data_hour, data_hour_cleaned, "hour.csv")

Dataset: day.csv
Jumlah baris sebelum pembersihan: 731
Jumlah baris dihapus (mengandung outlier): 76
Jumlah baris setelah pembersihan: 655
--------------------------------------------------
Dataset: hour.csv
Jumlah baris sebelum pembersihan: 17379
Jumlah baris dihapus (mengandung outlier): 2580
Jumlah baris setelah pembersihan: 14799
--------------------------------------------------
