# Proyek analisis Data: Bike Sharing Dataset
- **Nama:** Nandisya Faiz Effendi
- **Email:** faizeffendi2004@gmail.com
- **ID Dicoding:** faiz-effendi

# Menentukan Pertanyaan Bisnis

- Bagaimana tren penyewaan sepeda keseluruhan bervariasi antara musim?
- Apa dampak adanya hari libur terhadap penyewaan sepeda keseluruhan dibanding dengan hari kerja?

# Import Semua Packages/Library yang Digunakan

In [1]:
# data process
import pandas as pd
import numpy as np

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# dashboard
import streamlit as st

# Data Wrangling

## Gathering Data

In [2]:
day_df = pd.read_csv('./day.csv')
hour_df = pd.read_csv('./hour.csv') # tidak digunakan 

In [3]:
day_df.sample(5)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
395,396,2012-01-31,1,1,1,0,2,1,1,0.39,0.381317,0.416667,0.261817,324,4185,4509
647,648,2012-10-09,4,1,10,0,2,1,2,0.446667,0.438112,0.761667,0.1903,601,5791,6392
588,589,2012-08-11,3,1,8,0,6,0,2,0.6925,0.638267,0.732917,0.206479,2247,4052,6299
562,563,2012-07-16,3,1,7,0,1,1,1,0.763333,0.724125,0.645,0.164187,1088,5742,6830
37,38,2011-02-07,1,0,2,0,1,1,1,0.271667,0.303658,0.738333,0.045408,120,1592,1712


In [4]:
hour_df.sample(5)

Unnamed: 0,instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
2496,2497,2011-04-18,2,0,4,22,0,1,1,1,0.52,0.5,0.55,0.2836,9,81,90
353,354,2011-01-16,1,0,1,5,0,0,0,2,0.26,0.2576,0.56,0.1642,1,1,2
6487,6488,2011-10-02,4,0,10,21,0,0,0,1,0.36,0.3485,0.71,0.2239,17,71,88
6181,6182,2011-09-20,3,0,9,3,0,2,1,2,0.54,0.5152,0.83,0.2239,0,3,3
1759,1760,2011-03-19,1,0,3,0,0,6,0,2,0.6,0.6212,0.53,0.2537,26,50,76


**Insight:**
- perbedaan hanya pada hour.csv memiliki column/feature hr yang mewakilkan hours (0 - 23)

## Assessing Data

### mencari missing value dalam dataframe

In [5]:
day_df.isna().sum()

instant       0
dteday        0
season        0
yr            0
mnth          0
holiday       0
weekday       0
workingday    0
weathersit    0
temp          0
atemp         0
hum           0
windspeed     0
casual        0
registered    0
cnt           0
dtype: int64

### menghitung banyak duplikasi dalam dataframe

In [6]:
print(f'Duplikasi data: {day_df.duplicated().sum()}')

Duplikasi data: 0


### mencari apakah ada data yang invalid atau tidak sesuai ketentuan

In [7]:
day_df.describe()

Unnamed: 0,instant,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
count,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0,731.0
mean,366.0,2.49658,0.500684,6.519836,0.028728,2.997264,0.683995,1.395349,0.495385,0.474354,0.627894,0.190486,848.176471,3656.172367,4504.348837
std,211.165812,1.110807,0.500342,3.451913,0.167155,2.004787,0.465233,0.544894,0.183051,0.162961,0.142429,0.077498,686.622488,1560.256377,1937.211452
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.05913,0.07907,0.0,0.022392,2.0,20.0,22.0
25%,183.5,2.0,0.0,4.0,0.0,1.0,0.0,1.0,0.337083,0.337842,0.52,0.13495,315.5,2497.0,3152.0
50%,366.0,3.0,1.0,7.0,0.0,3.0,1.0,1.0,0.498333,0.486733,0.626667,0.180975,713.0,3662.0,4548.0
75%,548.5,3.0,1.0,10.0,0.0,5.0,1.0,2.0,0.655417,0.608602,0.730209,0.233214,1096.0,4776.5,5956.0
max,731.0,4.0,1.0,12.0,1.0,6.0,1.0,3.0,0.861667,0.840896,0.9725,0.507463,3410.0,6946.0,8714.0


### Identifikasi Outliers

In [8]:
q25, q75 = np.percentile(day_df['casual'], 25), np.percentile(day_df['casual'], 75)
iqr = q75-q25
cut_off = iqr * 1.5
minimum, maximum = q25 - cut_off, q75 + cut_off

outliers_casual_users = [x for x in day_df['casual'] if x < minimum or x > maximum]

In [9]:
q25, q75 = np.percentile(day_df['registered'], 25), np.percentile(day_df['registered'], 75)
iqr = q75-q25
cut_off = iqr * 1.5
minimum, maximum = q25 - cut_off, q75 + cut_off

outliers_registered_users = [x for x in day_df['registered'] if x < minimum or x > maximum]

In [10]:
if outliers_casual_users:
    print(f'Banyak outlier pengguna casual: {len(outliers_casual_users)}')
else:
    print('Tidak terdapat outliers')

Banyak outlier pengguna casual: 44


In [11]:
if outliers_registered_users:
    print(f'Banyak outlier pengguna casual: {len(outliers_registered_users)}')
else:
    print('Tidak terdapat outliers')

Tidak terdapat outliers


**Insight:**
- assessing data berdasarkan missing value dan duplicated data menunjukan tidak adanya keanehan dalam data
- berdasarkan descriptive statistics tidak ditemukan adanya data yang invalid
- column/feature casual terdapat 

## Cleaning Data

### mengatasi outlier dengan batas minimum atau maximum

In [18]:
kondisi_lower_than = day_df['casual'] < minimum
kondisi_more_than = day_df['casual'] > maximum

day_df['casual'].mask(cond = kondisi_more_than, maximum, inplace=True)
day_df['casual'].mask(cond = kondisi_lower_than, minimum, inplace=True)

SyntaxError: positional argument follows keyword argument (1860333534.py, line 4)

### mengganti nama column agar lebih merepresentasikan isinya

In [13]:
day_df.rename(columns={
    'dteday': 'date',
    'yr': 'year',
    'mnth': 'month',
    'cnt': 'total_book',
}, inplace=True)

In [14]:
day_df['date'] = pd.to_datetime(day_df['date'])

In [15]:
day_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 731 entries, 0 to 730
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   instant     731 non-null    int64         
 1   date        731 non-null    datetime64[ns]
 2   season      731 non-null    int64         
 3   year        731 non-null    int64         
 4   month       731 non-null    int64         
 5   holiday     731 non-null    int64         
 6   weekday     731 non-null    int64         
 7   workingday  731 non-null    int64         
 8   weathersit  731 non-null    int64         
 9   temp        731 non-null    float64       
 10  atemp       731 non-null    float64       
 11  hum         731 non-null    float64       
 12  windspeed   731 non-null    float64       
 13  casual      731 non-null    int64         
 14  registered  731 non-null    int64         
 15  total_book  731 non-null    int64         
dtypes: datetime64[ns](1), floa

In [16]:
day_df['season'] = day_df['season'].apply(lambda season: 'spring' if season == 1 else ('summer' if season == 2 else ('fall' if season == 3 else 'winter')))

In [17]:
day_df.sample(5)

Unnamed: 0,instant,date,season,year,month,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,total_book
107,108,2011-04-18,summer,0,4,0,1,1,1,0.5125,0.503146,0.5425,0.163567,669,2760,3429
112,113,2011-04-23,summer,0,4,0,6,0,2,0.46,0.450121,0.887917,0.230725,1462,2574,4036
556,557,2012-07-10,fall,1,7,0,2,1,2,0.720833,0.664796,0.6675,0.151737,954,5336,6290
270,271,2011-09-28,winter,0,9,0,3,1,2,0.635,0.575158,0.84875,0.148629,480,3427,3907
278,279,2011-10-06,winter,0,10,0,4,1,1,0.494167,0.480425,0.620833,0.134954,639,4126,4765


**Insight:**
- insight1
- insight2

# Exploratory Data Analysis (EDA)

## Explore

**Insight:**
- insight1
- insight2

# Visualization & Explanatory Analysis

## Pertanyaan 1:

## Pertanyaan 2:

**Insight:**
- insight1
- insight2

# Analisis Lanjutan (RFM or other | Optional)

# Conclusion

- conc 1
- conc 2