# Proyek Analisis Data: Bike Sharing Dataset
- **Nama:** Alfidah
- **Email:** m315d4kx2173@bangkit.academy
- **ID Dicoding:** alfidah

## Menentukan Pertanyaan Bisnis

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return
back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return
back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of
over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic,
environmental and health issues.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by
these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration
of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into
a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important
events in the city could be detected via monitoring these data.

- How does daily bike vary change over time, and which season or month experiences the highest usage? Is there a correlation between bike usage and the number of holidays in that season or month?
- What are the mean values of temperature (temp), feels-like temperature (atemp), humidity (hum), and wind speed? How do these factors correlate with the number of daily bike rentals?

## Import Semua Packages/Library yang Digunakan

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

: 

## Data Wrangling

- instant: record index
- dteday : date
- season : season (1:springer, 2:summer, 3:fall, 4:winter)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
  - 1: Clear, Few clouds, Partly cloudy, Partly cloudy
  - 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
  - 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
  - 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

### Gathering Data

In [None]:
day = pd.read_csv('day.csv')

day

: 

In [None]:
hour = pd.read_csv('hour.csv')

hour

: 

The hour dataframe provides detailed information for each hour, in contrast to the day dataframe, which offers daily information. Therefore, I believe merging these two dataframes is unnecessary.

### Assessing Data

In [None]:
day.info()

: 

In [None]:
hour.info()

: 

It appears that all the columns are numeric except for 'dteday'. We will convert this column to the DateTime type to facilitate time series visualization later in the cleaning process. Now, let's check if the data has any missing values.

In [None]:
day.isna().sum()

: 

In [None]:
hour.isna().sum()

: 

After confirming that there are no missing values in the data, we should now investigate whether the dataset contains any duplicated entries.



In [None]:
day.duplicated().sum()

: 

In [None]:
hour.duplicated().sum()

: 

There are also no duplicated values in the data. Now, we will check for any inconsistencies within the data. I will first verify in the day dataframe whether the 'cnt' column equals the sum of the 'registered' and 'casual' columns.

In [None]:
is_equal = (day['casual'] + day['registered'] == day['cnt']).all()

print("Is the sum of 'casual' and 'registered' equal to 'cnt' for all rows?:", is_equal)

: 

There are also 731 entries in the day dataframe, which corresponds to the total number of days in the years 2011 and 2012 combined.

Next, I will examine the hour dataframe by grouping it by day to see if the results match those of the day dataframe.

In [None]:
grouped = hour.groupby('dteday').agg({
    'season': 'first',
    'yr': 'first',
    'mnth': 'first',
    'holiday': 'first',
    'weekday': 'first',
    'workingday': 'first',
    'weathersit': 'first',
    'temp': 'mean',
    'atemp': 'mean',
    'hum': 'mean',
    'windspeed': 'mean',
    'casual': 'sum',
    'registered': 'sum',
    'cnt': 'sum'
})

grouped = grouped.reset_index()
grouped['instant'] = range(1, len(grouped) + 1)

grouped

: 

Upon reviewing the data, I believe we can conclude that it is consistent. There might be slight inconsistencies caused by rounding issues for variables such as temperature, but these are not immediately apparent.

The last step in our data wrangling process will be to check for any outliers in our data. We will use the function defined in the Dicoding course to perform this check.

In [None]:
def check_outliers(data):
  data = data.dropna()
  q25, q75 = np.percentile(data, 25), np.percentile(data, 75)
  iqr = q75 - q25
  cut_off = iqr * 1.5
  minimum, maximum = q25 - cut_off, q75 + cut_off

  outliers = [x for x in data if x < minimum or x > maximum]
  return len(outliers)

: 

In [None]:
for column in day.drop(columns = ['dteday']).columns:
    outliers_count = check_outliers(day[column])
    print(f"The column {column} has {outliers_count} outlier values.")

: 

In [None]:
for column in hour.drop(columns = ['dteday']).columns:
    outliers_count = check_outliers(hour[column])
    print(f"The column {column} has {outliers_count} outlier values.")

: 

### Cleaning Data

For the data cleaning process, we will begin by converting the 'dteday' column to the DateTime type.

In [None]:
day['dteday'] = pd.to_datetime(day['dteday'])
hour['dteday'] = pd.to_datetime(hour['dteday'])

: 

Since there are no missing or duplicated values, we will focus on handling outliers using the imputation method to avoid losing significant information. We will also utilize the function defined in the Dicoding course for this purpose.

In [None]:
def impute_outliers(column):
    Q1 = column.quantile(0.25)
    Q3 = column.quantile(0.75)
    IQR = Q3 - Q1

    maximum = Q3 + (1.5 * IQR)
    minimum = Q1 - (1.5 * IQR)

    condition_lower_than = column < minimum
    condition_more_than = column > maximum

    column = column.apply(lambda x: maximum if x > maximum else (minimum if x < minimum else x))
    return column

: 

Let's first list the columns that contain outliers before applying the function. We will exclude 'holiday' from imputation since the value 1 will consistently be identified as an outlier, given the infrequent occurrence of holidays

In [None]:
day_outliers = ['hum', 'windspeed', 'casual']
hour_outliers = ['weathersit', 'hum', 'windspeed', 'casual', 'registered', 'cnt']

: 

In [None]:
for column_name in day_outliers:
    day[column_name] = impute_outliers(day[column_name])

for column_name in hour_outliers:
    hour[column_name] = impute_outliers(hour[column_name])

: 

By imputing outliers with maximum and minimum values, we may not eliminate all outliers but can significantly reduce their number. Let's check again.

In [None]:
for column in day.drop(columns = ['dteday']).columns:
    outliers_count = check_outliers(day[column])
    print(f"The column {column} has {outliers_count} outlier values.")

: 

In [None]:
for column in hour.drop(columns = ['dteday']).columns:
    outliers_count = check_outliers(hour[column])
    print(f"The column {column} has {outliers_count} outlier values.")

: 

Great, now the dataframe is clean, and we can proceed with the Exploratory Data Analysis (EDA) process.

## Exploratory Data Analysis (EDA)

### Explore Data Summary Statistics

We will explore an overview of the data's summary statistics using the `describe()` method.

In [None]:
day.describe()

: 

In [None]:
hour.describe()

: 

### Explore Numeric Column Distribution (Continuous)

For continuous data, we will examine the distribution using histogram visualizations in seaborn. For this reason, we will exclude other numeric columns that are discrete.

In [None]:
numeric_columns = day.drop(columns = ['instant', 'season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit']).select_dtypes(include=['number']).columns
n_cols = 4
n_rows = (len(numeric_columns) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
fig.suptitle('Distribution of Day Numeric Columns (Continuous)', fontsize=16)

for i, column in enumerate(numeric_columns):
    row, col = divmod(i, n_cols)
    sns.histplot(day[column], kde=True, ax=axes[row, col])
    axes[row, col].set_title(f'Distribution of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

: 

In [None]:
numeric_columns = hour.drop(columns = ['instant', 'season', 'yr', 'mnth', 'hr','holiday', 'weekday', 'workingday', 'weathersit']).select_dtypes(include=['number']).columns
n_cols = 4
n_rows = (len(numeric_columns) + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
fig.suptitle('Distribution of Hour Numeric Columns (Continuous)', fontsize=16)

for i, column in enumerate(numeric_columns):
    row, col = divmod(i, n_cols)
    sns.histplot(hour[column], kde=True, ax=axes[row, col])
    axes[row, col].set_title(f'Distribution of {column}')
    axes[row, col].set_xlabel(column)
    axes[row, col].set_ylabel('Frequency')

plt.tight_layout()
plt.show()

: 

The data exhibits normal distributions and some skewness. An interesting pattern emerges in the hourly visualizations, where spikes for casual, registered, and cnt at maximum values are observed. This may be influenced by the imputation of outlier values previously conducted.

### Explore Numeric Column Distribution (Discrete)

For the discrete numeric columns, we will focus on examining the distribution of `holiday`, `workingday`, and `weathersit` solely from the 'day' dataframe. The rationale for not analyzing variables such as `season`, `year`, `month`, and `weekday` is that, given the daily nature of the data, the results would remain consistent.
Additionally, there is no need to assess the 'hour' dataframe, as it would merely increase the count of observations without altering the distribution of variables like `holiday`, `workingday`, and `weathersit`, which remain constant throughout the hours.

In [None]:
numeric_columns = ['holiday', 'workingday', 'weathersit']

n_cols = 3
n_rows = (len(numeric_columns) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(15, n_rows * 4))
fig.suptitle('Count Distribution of Day Numeric Columns (Discrete)', fontsize=16)

axes = axes.flatten()

for i, column in enumerate(numeric_columns):
    sns.countplot(x=column, data=day, ax=axes[i])
    axes[i].set_title(f'Count of {column}')
    axes[i].set_xlabel(column)
    axes[i].set_ylabel('Count')

plt.tight_layout()
plt.show()

: 

### Explore Numeric Column Correlation

In [None]:
corr_day = day[day.drop(columns=['instant']).select_dtypes(include=['number']).columns].corr()
corr_hour = hour[hour.drop(columns=['instant']).select_dtypes(include=['number']).columns].corr()

mask_day = np.triu(np.ones_like(corr_day, dtype=bool))
mask_hour = np.triu(np.ones_like(corr_hour, dtype=bool))
fig, ax = plt.subplots(1, 2, figsize=(20, 8))

sns.heatmap(corr_day, mask=mask_day, annot=True, fmt=".2f", cmap='coolwarm',
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax[0])
ax[0].set_title('Day Correlation Matrix Heatmap')

sns.heatmap(corr_hour, mask=mask_hour, annot=True, fmt=".2f", cmap='coolwarm',
            linewidths=.5, cbar_kws={"shrink": .5}, ax=ax[1])
ax[1].set_title('Hour Correlation Matrix Heatmap')

plt.tight_layout()
plt.show()

: 

Although we should use the appropriate correlation method for categorical variables, the materials did not cover this aspect. Therefore, we will assume that all data can be analyzed using the default correlation method `corr()`.

We observe that temperature, season, year, and hour play roles in affecting bike usage, as indicated by the correlation with the 'cnt' column.

### Explore Monthly Bike Usage Distribution and Clustering

In this exploration, we will group the data by month and then examine the total bike usage from the 'cnt' column.

In [None]:
monthly_rentals = day.groupby(by = 'mnth').agg({
  'cnt': 'sum',
}).reset_index()

: 

In [None]:
plt.figure(figsize=(10, 6))

sns.barplot(x='mnth', y='cnt', data=monthly_rentals)
plt.title('Explore Monthly Bike Usage Distribution')
plt.xlabel('Month')
plt.ylabel('Total Bike Rentals')
plt.show()

: 

This data is already informative, but it still appears somewhat monotonous. We can delve deeper by categorizing or clustering it based on threshold values. I will classify bike usage exceeding 300,000 in a month as 'High', over 200,000 as 'Medium', and anything below as 'Low'.

In [None]:
def classify_usage(cnt):
    if cnt > 300000:
        return 'High'
    elif cnt > 200000:
        return 'Medium'
    else:
        return 'Low'

monthly_rentals['category'] = monthly_rentals['cnt'].apply(classify_usage)

monthly_rentals

: 

In [None]:
plt.figure(figsize=(10, 6))

sns.barplot(x='mnth', y='cnt', hue='category', data=monthly_rentals, palette='deep')
plt.title('Explore Monthly Bike Usage Distribution')
plt.xlabel('Month')
plt.ylabel('Total Bike Rentals')
plt.legend(title='Category')

plt.show()

: 

As you can now see, the data has become more insightful. We can observe that mid-year experiences the highest bike usage, while the period towards the end and before the mid-year shows medium usage, and the early part of the year has the lowest.

## Visualization & Explanatory Analysis

### Pertanyaan 1: How does daily bike usage change over time, and which season or month experiences the highest usage? Is there a correlation between bike usage and the number of holidays in that season or month?

In [None]:
plt.figure(figsize=(15, 7))
plt.plot(day['dteday'], day['cnt'], label='Total', color='blue')
plt.plot(day['dteday'], day['registered'], label='Registered', color='green')
plt.plot(day['dteday'], day['casual'], label='Casual', color='red')

plt.title('Daily Bike Usage Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Bike Rentals')
plt.legend()
plt.tight_layout()
plt.show()

: 

The daily line plot reveals that **bike usage peaked in the mid to late year**. This correlates with the earlier exploratory analysis, where mid-year exhibited the highest usage. There is no distinct difference between registered and casual users; they exhibit similar patterns, albeit with casual users showing lower counts.

In [None]:
monthly_rentals = day.groupby(by = 'mnth').agg({
  'cnt': 'sum',
  'holiday': 'sum'
}).reset_index()

monthly_rentals

: 

In [None]:
plt.figure(figsize=(10, 6))

sns.barplot(x='mnth', y='cnt', data=monthly_rentals, palette='pastel', hue = 'mnth', legend = False)
plt.title('Explore Monthly Bike Usage Distribution')
plt.xlabel('Month')
plt.ylabel('Total Bike Rentals')
plt.show()

: 

In [None]:
seasonly_rentals = day.groupby(by = 'season').agg({
  'cnt': 'sum',
  'holiday': 'sum'
}).reset_index()

seasonly_rentals

: 

In [None]:
plt.figure(figsize=(10, 6))

sns.barplot(x='season', y='cnt', data=seasonly_rentals, palette='pastel', hue='season', legend=False)


plt.title('Explore Seasonly Bike Usage Distribution')
plt.xlabel('Season')
plt.ylabel('Total Bike Rentals')
plt.show()

: 

The visualization illustrates that bike usage peaked in **August** and during the **fall** season when the number of holidays is lower. This suggests that **number holidays do not significantly affect bike usage**, a conclusion supported by previous correlation analysis, which indicated almost no correlation between holidays and bike count

### Pertanyaan 2: What are the mean values of temperature (temp), feels-like temperature (atemp), humidity (hum), and wind speed in each season or month? How do these factors correlate with the number of daily bike rentals?

Since the data is slightly skewed, we will use the median as the measure of central tendency.

In [None]:
monthly = day.groupby(by = 'mnth').agg({
  'temp': 'median',
  'atemp': 'median',
  'hum': 'median',
  'windspeed': 'median',
  'cnt': 'sum'
}).reset_index()

monthly

: 

In [None]:
columns_to_plot = monthly.drop(columns='mnth').columns

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for i, column in enumerate(columns_to_plot):
    if column != 'cnt':
        sns.scatterplot(x=column, y='cnt', hue='mnth', data=monthly, alpha=0.5, s=500, palette='deep', ax=axes[i])
        axes[i].set_title(f'Scatterplot of Total Bike Count vs {column} with Month as Hue')
        axes[i].set_xlabel(column)
        axes[i].set_ylabel('Total Bike Usage')
        axes[i].grid(True)

plt.tight_layout()
plt.show()

: 

In [None]:
seasonly = day.groupby(by = 'season').agg({
  'temp': 'median',
  'atemp': 'median',
  'hum': 'median',
  'windspeed': 'median',
  'cnt': 'sum'
}).reset_index()

seasonly

: 

In [None]:
columns_to_plot = seasonly.drop(columns='season').columns

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.flatten()

for i, column in enumerate(columns_to_plot):
    if column != 'cnt':
        sns.scatterplot(x=column, y='cnt', hue='season', data=seasonly, alpha=0.5, s=500, palette='deep', ax=axes[i])
        axes[i].set_title(f'Scatterplot of Total Bike Count vs {column} with Season as Hue')
        axes[i].set_xlabel(column)
        axes[i].set_ylabel('Total Bike Usage')
        axes[i].grid(True)

plt.tight_layout()
plt.show()

: 

The plot shows that there is a **moderate correlation between temp, atemp, and cnt**, and a **weak negative correlation between windspeed and cnt**. Humidity, on the other hand, almost doesn't correlate with cnt. This is supported by the exploratory data analysis with the correlation values.

## Conclusion

- The analysis reveals that bike usage reached its peak during the mid to late year, notably in August and the fall season. Surprisingly, holidays did not significantly impact bike usage, as evidenced by the low correlation value, suggesting a lack of correlation between the number of holidays and the total bike count.

- Examining temperature patterns, both actual and perceived (as reflected in "atemp"), we observe a similar trend peaking during the mid to late year. Humidity levels also peaked during this period, while windspeed was highest in the early year. The analysis further indicates that higher temperatures coincide with increased bike usage, whereas lower windspeeds correlate with higher bike usage. Currently, there appears to be a moderate level of correlation, though further investigation is warranted to ascertain whether these factors exhibit multicollinearity.