# London Bike Sharing -- EDA and Visualization-- 

## **About Data**

using Kaggle london-bike-sharing-dataset


In order to reduce traffic congestion and air pollution in London, the government encourages people to use shared bicycles. This demand for shared bicycles is influenced by several factors. Factors such as air temperature, humidity, wind, whether it is a holiday or weekend and the season are all important. We can use this information to get some conclusions about people's habits,etc when it comes to bike sharing. 


**Dataset Schema** 

**Timestamp:** timestamp indicating the date and time of each observation.<br>
**cnt (count):** the count of new bike shares recorded for each timestamp.<br>
**t1:** real temperature in Celsius at the time of observation.<br>
**t2:** perceived temperature in Celsius, known as Real Feel Temperature<br>
**Hum (humudity):**  humidity level expressed as a percentage.<br>
**windspeed:** Self Explanatory.<br>
**weather_code:** The categories according to this chart:

* **1 -** Clear or mostly clear with possible haze, fog, or patches of fog
* **2 -** Scattered clouds or few clouds
* **3 -** Broken clouds
* **4 -** Cloudy
* **7 -** Rain, light rain shower, or light rain
* **10 -** Rain with thunderstorm 
* **26 -** Snowfall.

**is _holiday:** The observation is a holiday (1) or a non-holiday (0).<br>
**is_weekend:** The observation falls on a weekend (1) or a non-weekend day (0).<br>
**season:** Categories of the column:

    **0 -** Spring, 
    **1 -** Summer, 
    **2 -** Fall, 
    **3 -** Winter.



In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")


In [None]:
df = pd.read_csv("london_merged.csv")
df.head()

## a) Let's check the structure of the dataframe

In [None]:
df.shape

In [None]:
df.info()

** CONCLUSION 1: 10 columns, 'weather_code','is_holiday', 'is_weekend', 'season' are float type, it is adviseable to replace their codified values with friendly legends, so later on results of calculations are more readable, for example in col 'is_weekend' we should see Weekend or Not Weekend value instead of 1's and 0's.

** CONCLUSION 2: 'timestamp' column is an object, ie. a string. Given the nature of the value we should change it to date object in case we need to do some time series calculations or extracting the values of day of week o day number

In [None]:
df.columns

In [None]:
df.isnull().sum()

**there are no Null values in the dataframe

## b) Time to implement some changes on the columns we mentioned before. Note the "inplace" parameter set True so that the changes take place in the dataset

In [None]:
# Season
df['season'].replace(0, 'Spring', inplace=True)
df['season'].replace(1, 'Summer', inplace=True)
df['season'].replace(2, 'Fall', inplace=True)
df['season'].replace(3, 'Winter', inplace=True)

#--------------------------------------------------

#Is Holiday
df['is_holiday'].replace(1, 'Holiday', inplace=True)
df['is_holiday'].replace(0, 'Non-holiday', inplace=True)

#--------------------------------------------------------

# Is weekend
df['is_weekend'].replace(1, 'Weekend', inplace=True)
df['is_weekend'].replace(0, 'Non-weekend', inplace=True)

#------------------------------------------------------------

#Weather Codes
df['weather_code'].replace(1, 'Clear', inplace=True)
df['weather_code'].replace(2, 'scattered clouds', inplace=True)
df['weather_code'].replace(3, 'Broken clouds', inplace=True)
df['weather_code'].replace(4, 'Cloudy', inplace=True)
df['weather_code'].replace(7, 'Rain', inplace=True)
df['weather_code'].replace(10, 'rain with thunderstorm', inplace=True)
df['weather_code'].replace(26, 'snowfall', inplace=True)
df['weather_code'].replace(94, 'Freezing Fog', inplace=True)
df.head()

**Now we add a new column 'Date' based on  'timestamp' but with datetime type**

In [None]:
df['date'] = pd.to_datetime(df['timestamp']) 
df.info()


In [None]:

df['day'] = df['date'].dt.day_name()
df['month'] = df['date'].dt.month_name()
df.head(2)

## b) Distribution of share count per day of week, month, weather, season and holiday

### We can estimate the amount of bike renting per day, by grouping the rows

In [None]:
# Day

daily_bike_share  = (df.groupby('day')['cnt'].sum()/1000000).round(1)
daily_bike_share

**Observation: On Thursday, Tuesday and Wednesday we have 3.1 million renting per day**

In [None]:
plt.figure(figsize=(12,5))
ax=sns.barplot(x=daily_bike_share.index, y=daily_bike_share.values, errorbar=None)
ax.set_title("Sum of renting per day (Million)")
ax.set_xlabel('') # To cancel writing "day" on the x-axis 
ax.set_ylabel('In Million')
for i in ax.containers:
    ax.bar_label(i)

### We can estimate the amount of share counts per month, and express the figure in million

In [None]:
# Month

monthly_bike_share  = (df.groupby('month')['cnt'].sum()/1000000).round(1)
monthly_bike_share

In [None]:
# Month

plt.figure(figsize=(12,5))

ax=sns.barplot(x=monthly_bike_share.index, y=monthly_bike_share.values, errorbar=None)
ax.set_title("Monthly Bike Sharing (Million)")
ax.set_xlabel('Month')  
ax.set_ylabel('In Million')
for i in ax.containers:
    ax.bar_label(i)

### Distribution per  season

In [None]:
#Season 

seasonally_bike_share  = (df.groupby('season')['cnt'].sum()/1000000).round(1)
seasonally_bike_share

In [None]:
#Season 

plt.figure(figsize=(12,5))

ax=sns.barplot(x=seasonally_bike_share.index, y=seasonally_bike_share.values, errorbar=None)
ax.set_title("Seasonally Bike Sharing (Million)")
ax.set_xlabel('Season')  

for i in ax.containers:
    ax.bar_label(i)

### Distribution per weather condition

In [None]:
# Weather 

weather_bike_share  = (df.groupby('weather_code')['cnt'].sum()/1000000).round(1)
weather_bike_share

In [None]:
plt.figure(figsize=(12,5))

ax=sns.barplot(x=weather_bike_share.index, y=weather_bike_share.values, errorbar=None)
ax.set_title("Bike Sharing at Different Weather Conditions (Million)")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
ax.set_xlabel('weather')  

for i in ax.containers:
    ax.bar_label(i)

### Distribution per  Holiday value

In [None]:
# Holiday 

weather_bike_share  = (df.groupby("is_holiday")['cnt'].sum()/1000000).round(1)
weather_bike_share

In [None]:
# Holiday 

plt.figure(figsize=(12,5))

ax=sns.barplot(x=weather_bike_share.index, y=weather_bike_share.values, errorbar=None)
ax.set_title("Bike Sharing at Holidays (Million)")
ax.set_xlabel('Holiday Value')  

for i in ax.containers:
    ax.bar_label(i)

### Distribution per  weekends/weekdays

In [None]:
# Weekend 

weekend_bike_share  = (df.groupby('is_weekend')['cnt'].sum()/1000000).round(1)
weekend_bike_share

In [None]:
# Weekend

plt.figure(figsize=(12,5))

ax=sns.barplot(x=weekend_bike_share.index, y=weekend_bike_share.values, errorbar=None)
ax.set_title("Bike Sharing at Weekends (Million)")
ax.set_xlabel('is weekend')  

for i in ax.containers:
    ax.bar_label(i)

* Bike sharing high on non-week days.

## WAY for drawing figures

In [None]:
# 2. Way:

fig, ax = plt.subplots(2,2, figsize=(12, 12))


#ax=sns.barplot(data=df, x="season", y="cnt", errorbar=None)

# first plot by season
sns.barplot(data=df, x="season", y="cnt", ax = ax[0][0], errorbar=None)
ax[0][0].set_xlabel("Season")
ax[0][0].set_ylabel('Count')
ax[0][0].set_title('Distribution of Season')
ax[0][0].set_xlabel('')

# second plot by holiday
sns.countplot(data=df, x="is_holiday", ax=ax[0][1])
ax[0][1].set_xlabel("is_holiday")
ax[0][1].set_ylabel('Count')
ax[0][1].set_title('Distribution of is_holiday')
ax[0][1].set_xlabel('')

# third plot by weekend
sns.countplot(data=df, x="is_weekend", ax=ax[1][0])
ax[1][0].set_xlabel("is_weekend")
ax[1][0].set_ylabel('Count')
ax[1][0].set_title('Distribution of is_weekend')
ax[1][0].set_xlabel('')

# fourth plot by weather code
sns.countplot(data=df, x="weather_code", ax=ax[1][1])
ax[1][1].set_xlabel("weather_code")
ax[1][1].set_ylabel('Count')
ax[1][1].set_title('Distribution of weather_code')
ax[1][1].set_xticklabels(ax[1][1].get_xticklabels(), rotation=45, ha='right')
ax[1][1].set_xlabel('')

fig.tight_layout()
plt.show()

**Insight:** People rent bicycles most often on days when the weather is "clear". In second place are the days with "scatter clouds" and in third place are the days with "broken clouds". Interestingly, more bicycles are rented on "rainy" days than on "cloudy" days. On snowy and stormy days, bicycle hire is around zero.
* Demand for bicycle hire is higher on working days and on days with clear weather.

## boxenplots for outlier analyses

In [None]:
plt.figure(figsize=(20,3))
plt.title('Number of Counts')
sns.boxenplot(x=df["cnt"], color='g');

* 5 outliers spotted (cnt) on the higher side.

In [None]:
plt.figure(figsize=(20,3))
plt.title('Humidity')
sns.boxenplot(x=df["hum"], color='g');

* 4 outliers spotted in hummudity on the low side.

In [None]:
plt.figure(figsize=(20,3))
plt.title('Wind Speed')
sns.boxenplot(x=df["wind_speed"], color='g');

* 4 outliers spotted in wind speed on the higher side.

# Let's look at the data type of each variable, transform timestamp in type, and set it as index.

In [None]:
# Read again data set;

df=pd.read_csv("london_merged.csv")
df.sample(2)

In [None]:
df.info()

In [None]:
# Let's convert the variable "timestamp" from object format to "datetime" format and store 
# it in a new column;

df['date'] = pd.to_datetime(df['timestamp'])

In [None]:
df['date'].dtype

In [None]:
#I want to use timestamp as index.
# If we don`t use timespamp as index, we couldn`t use next method for extracting date infos from timestamp;

df = df.set_index(df['date']) 
df.head(2)

# Make feature engineering. Extract new columns (day of the week, day of the month, hour, month, season, year etc.


![image.png](attachment:31a3c834-5e97-4556-8000-b56db4a75576.png)

https://www.programiz.com/python-programming/datetime/strftime

**Directive Meaning Example**

* **%a** Abbreviated weekday name. Sun, Mon, ...
* **%A** Full weekday name. Sunday, Monday, ...
* **%w** Weekday as a decimal number. 0, 1, ..., 6
* **%d** Day of the month as a zero-padded decimal. 01, 02, ..., 31
* **%-d** Day of the month as a decimal number. 1, 2, ..., 30
* **%b** Abbreviated month name. Jan, Feb, ..., Dec
* **%B** Full month name. January, February, ...
* **%m** Month as a zero-padded decimal number. 01, 02, ..., 12
* **%-m** Month as a decimal number. 1, 2, ..., 12
* **%y** Year without century as a zero-padded decimal number. 00, 01, ..., 99
* **%-y** Year without century as a decimal number. 0, 1, ..., 99
* **%Y** Year with century as a decimal number. 2013, 2019 etc.
* **%H** Hour (24-hour clock) as a zero-padded decimal number. 00, 01, ..., 23
* **%-H** Hour (24-hour clock) as a decimal number. 0, 1, ..., 23
* **%I** Hour (12-hour clock) as a zero-padded decimal number. 01, 02, ..., 12
* **%-I** Hour (12-hour clock) as a decimal number. 1, 2, ... 12
* **%p** Locale’s AM or PM. AM, PM
* **%M** Minute as a zero-padded decimal number. 00, 01, ..., 59
* **%-M** Minute as a decimal number. 0, 1, ..., 59
* **%S** Second as a zero-padded decimal number. 00, 01, ..., 59
* **%-S** Second as a decimal number. 0, 1, ..., 59
* **%f** Microsecond as a decimal number, zero-padded on the left. 000000 - 999999
* **%z** UTC offset in the form +HHMM or -HHMM.
* **%Z** Time zone name.
* **%j** Day of the year as a zero-padded decimal number. 001, 002, ..., 366
* **%-j** Day of the year as a decimal number. 1, 2, ..., 366
* **%U** Week number of the year (Sunday as the first day of the week). All days in a new year preceding the first Sunday are considered to be in week 0. 00, 01, ..., 53
* **%W** Week number of the year (Monday as the first day of the week). All days in a new year preceding the first Monday are considered to be in week 0. 00, 01, ..., 53
* **%c** Locale’s appropriate date and time representation. Mon Sep 30 07:06:05 2013
* **%x** Locale’s appropriate date representation. 09/30/13
* **%X** Locale’s appropriate time representation. 07:06:05
* **%% A** literal '%' character. %


In [None]:
# Let's take years of "datetime" data.
df["year"] = df.index.strftime('%Y')
df["year"]

In [None]:
# What is the day number?

df["day"] = df.index.strftime("%d")  
print(df["day"])
pd


In [None]:
# Day names;

df['day_name'] = df.index.day_name()
df['day_name'] 

In [None]:
# Which day of month;

df["month"] = df.index.strftime("%m") # 2.YOL: df['day_of_month'] = df.index.day
df["month"]

In [None]:
# hour 1. WAY;  

df["hour"] = df.index.strftime("%H")
df["hour"]

In [None]:
# 2. WAY; Hour; 

df['hour2'] = df.index.hour
df['hour2'] 

# Visualize the correlation with a heatmap

In [None]:
plt.figure(figsize=(10,6))
sns.heatmap(df.select_dtypes("number").corr(), cmap="YlGnBu", annot=True);

* There is a very high (0.99) correlation between the temperature (t1) and the real feel (t2) and it would be appropriate to include only one of them in the ML analyses. 
* There is a positive and relatively strong (0.39) relation between the temperature and the number of bicycles rented. 
* There is a negative and relatively strong (0.45) relation between the temperature (t1) and humidity. 
* There is a negative and relatively strong (0.29) relation between wind speed and humidity. 

# Visualize the correlation of the target variable and the other features with barplot

In [None]:
# Corr():

df_corr_cnt = df.select_dtypes("number").corr()[["cnt"]].sort_values(by="cnt", ascending=False) # 2. way; df.corr().cnt
df_corr_cnt

In [None]:
plt.figure(figsize=(2,7))
sns.heatmap(df_corr_cnt, annot=True, vmin=-1, vmax=1);

* There is a positive correlation between bike sharing and temperature and a negative correlation between bike sharing and humidity. 
What is more, the highest correlation in absolute values is with humidity (-0.46).

In [None]:
# Relation with the target variable (cnt) and other variables by barplot

target_corr = df.select_dtypes("number").corr().drop(["cnt"])
target_corr

In [None]:
target = target_corr["cnt"].sort_values(ascending=False)
target

In [None]:
plt.figure(figsize=(14, 8))
sns.barplot(x=target.index, y=target.values)
plt.xlabel('Features')
plt.ylabel('Correlation')
plt.title('Correlation between cnt and Features')
plt.xticks(rotation=45) # in OOM we use ax.tick_params(axis = "x", rotation = 45) instead
plt.show()

* There are positive correlation between number of bike sharing and  temperature, hour, wind speed and season.On the other hand there are negative correlation between bike sharing and holidays, weekends, cold weather and hummudity.  

# Plot bike shares over time use lineplot.

In [None]:
# Relation with the target variable (cnt) and temperature (t1) with lineplot

plt.figure(figsize=(12,6))
sns.lineplot(data=df, x="t1", y="cnt")
plt.title("Relation with the Temperature and Number of Renting")
plt.xlabel("Temperature")

plt.show()

* With an increase in temperature, there is an increase in the number of bicycle rentals, going above 33 °C, there is a deep increase in rentals.

# Relation with the target variable (cnt) and humidity (hum)  with lineplot

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(data=df, x="hum", y="cnt")
plt.title("Relation Humidity Vs Number of Rentals")
plt.xlabel("Humidity")

plt.show()

* There is a negative relation between humidity and number of rentals.


# Target variable (cnt) vs wind speed (wind_speed) using lineplot

In [None]:
plt.figure(figsize=(12,6))
sns.lineplot(data=df, x="wind_speed", y="cnt")
plt.title("Relation with the Wind Speed and Number of Bycle")
plt.xlabel("Wind Speed")

plt.show()

* There is a concave parabolic relationship between wind speed and bicycle hire.
* While people prefer to rent bicycles in light windy weather, they give up renting bicycles when the wind speed exceeds 26 km/h.

# Plot bike shares by months and year_of_month (use lineplot, pointplot, barplot).

In [None]:
df=pd.read_csv("london_merged.csv")

In [None]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [None]:
df = df.set_index('timestamp') 

In [None]:
df["year_month"] = df.index.strftime('%Y-%m') # we are obtaining year and moth together here.
df["year_month"]

In [None]:
#  Bike shares by months with lineplot 

plt.figure(figsize=(11,6))

sns.lineplot(data=df, x="year_month", y="cnt")
plt.title("Bike Shares by Months")
plt.xlabel("Months")
#plt.set_xticklabels(chart.get_xticklabels(), rotation=45)
plt.xticks(rotation=45)

plt.show()

* Demand for bicycle hire increases rapidly from March onwards, peaks in July and then starts to decline. The decline in demand accelerates after October, reaching its lowest levels between December and March. Comparing figures from the 2 years values are similar in similar parts of the year.

In [None]:
df["month"] = df.index.strftime('%m') # we are obtaining moth here.
df["month"]

In [None]:
# In order to be able to make this analysis by taking into account the changes over the years;

fig, ax = plt.subplots(3,1, figsize=(12,10))

sns.lineplot(data=df,  x = "month", y="cnt", ax =ax[0], hue = "is_weekend")
sns.pointplot(data=df, x = "month", y="cnt", ax =ax[1], hue = "is_holiday", errorbar=None)
sns.barplot(data=df,   x = "month", y="cnt", ax =ax[2], hue = "season", errorbar=None)

plt.show()

* The number / demand for bicycle hire decreases towards the end of the year.

In [None]:
df["hour"] = df.index.strftime('%H') # we are obtaining hour here.
df["hour"]

In [None]:
fig, ax = plt.subplots(3,1,figsize=(12,11))

sns.lineplot(data=df,  x="hour", y="cnt", hue="is_holiday", ax=ax[0])
sns.pointplot(data=df, x="hour", y="cnt", hue="is_weekend", ax=ax[1], errorbar=None)
sns.barplot(data=df,   x="hour", y="cnt", hue="season",     ax=ax[2], errorbar=None)

plt.show()

 # Plot bike shares by day of week. You may want to see whether it is a holiday or not

In [None]:
df["day_of_week"] = df.index.strftime('%w') # we are obtaining day of week here.
df["day_of_week"]

In [None]:
fig, ax = plt.subplots(3,1,figsize=(12,11))

sns.lineplot(data=df, x="day_of_week", y="cnt", hue="is_holiday", ax=ax[0])
sns.pointplot(data=df, x="day_of_week", y="cnt", hue="is_weekend", ax=ax[1], errorbar=None)
sns.barplot(data=df, x="day_of_week", y="cnt", hue="season", ax=ax[2], errorbar=None)

plt.show()

* Bicycle rental is higher on weekdays during non-holiday periods. It is higher in summer.

# Plot bike shares by day of month

In [None]:
df["day"] = df.index.strftime('%d') # we are obtaining day of the months here.
df["day"]

In [None]:
fig, ax = plt.subplots(3,1,figsize=(12,11))

sns.lineplot(data=df, x="day", y="cnt", hue="is_holiday", ax=ax[0])
sns.pointplot(data=df, x="day", y="cnt", hue="is_weekend", ax=ax[1], errorbar=None)
sns.barplot(data=df, x="day", y="cnt", hue="season", ax=ax[2], errorbar=None)

plt.show()

* Bicycle hire demand generally fluctuates between 1000-1300 per day. It decreases end of the months and weekends.

# Plot bike shares by year (Plot bike shares on holidays by seasons)

In [None]:
df["year"] = df.index.strftime('%Y') # we are obtaining year here.
df["year"]

In [None]:
fig, ax = plt.subplots(3,1,figsize=(12,11))

sns.lineplot(data=df,  x="year", y="cnt", hue="is_holiday", ax=ax[0])
sns.pointplot(data=df, x="year", y="cnt", hue="is_weekend", ax=ax[1], errorbar=None)
sns.barplot(data=df,   x="year", y="cnt", hue="season",     ax=ax[2], errorbar=None)

plt.show()

* According to these graphs, the demand for bicycle hire increased slightly from 2015 to 2016.
* The reason for the drop in 2017 is that the dataset contains only Jabuary data for this year.

# Visualize the distribution of bike shares by weekday/weekend with piechart and barplot

In [None]:
df.groupby("is_weekend")["is_weekend"].value_counts()

In [None]:
df_isweekend = df.groupby("is_weekend")["cnt"].mean()
df_isweekend

In [None]:
df_isweekend = round(df_isweekend,0)

In [None]:
#  Bike shares by weekday/weekend with barchart

#plt.bar(df_isweekend.index, df_isweekend.values); # Matplotlib

# sns.barplot(data=df, x=df_isweekend.index, y=df_isweekend.values); # Saeborn

ax = sns.barplot(x=df_isweekend.index, y=df_isweekend.values, data=df, errwidth=0) # Seaborn axlarla (sutun ustlerine degerleri yazdirabilmek icin)
for i in ax.containers:
    ax.bar_label(i)

In [None]:
#  Bike shares by weekday/weekend with piechart by using Matplotlib

plt.figure(figsize=(8,6))

plt.pie( df_isweekend.values, labels= df_isweekend.index, autopct="%1.1f%%")
plt.title("Bike Shares by Weekday/Weekend")
plt.xlabel("Weekday/Weekend")

plt.show()

* Demand for bicycle hire is lower at weekends.

# Plot the distribution of weather code by seasons

In [None]:
colors_of_seasons = ["lightgreen", "red", "yellow", "lightblue"]
plt.figure(figsize=(12,7))
sns.barplot(data=df, x="weather_code", y="cnt", hue="season", palette = colors_of_seasons, errorbar=None) # errorbar=None barların uzerinde ci cubuklarının cıkmasını onler.
plt.title("Bike Sharing on Weather Code  by Seasons")
plt.show()

* Demand for bicycle hire is higher in summer.

In [None]:
df_season = df.groupby("season")["cnt"].mean()
df_season

In [None]:
plt.figure(figsize=(10,7))

plt.pie( df_season.values, labels= df_season.index, autopct="%1.1f%%")
plt.title("Bike Shares by Seasons")
#plt.xlabel("Weekday/Weekend")

plt.show()

* 32% of the bicycle rental transactions took place in autumn, 25.8% in spring and 24.2% in winter.

In [None]:
#For weather_condition;
plt.figure(figsize=(9,6))

sns.countplot(data=df, x=df["weather_code"])
plt.title("The Effect of Weather on the Number of Bicycle Rentals")
plt.xlabel("Weather Condition")

plt.show()

In [None]:
#For holydays

plt.figure(figsize=(6,4))

sns.countplot(data=df, x=df["is_holiday"]);
plt.title("The Effect of Holydays on the Number of Bicycle Rentals")
plt.xlabel("Holiday or not")

plt.show()

In [None]:
df["new_feature"] = df.groupby(["is_holiday"])["cnt"].value_counts().mean()
df["new_feature"] = df["new_feature"].round()
df["new_feature"]

In [None]:
plt.figure(figsize=(6,4))

sns.countplot(data=df, x=df["new_feature"], hue="season" );
plt.title("The Effect of Holydays on the Number of Bicycle Rentals")
plt.xlabel("Holiday or not")

plt.show()

In [None]:
plt.figure(figsize=(10,7))
plt.title('Bike Number')
sns.stripplot(data=df, x="weather_code", y="cnt");

In [None]:
plt.figure(figsize=(10,7))
plt.title('Bike Number')
sns.stripplot(data=df, x="season", y="cnt");

In [None]:
plt.figure(figsize=(20,7))
sns.residplot( data=df, x="hum", y="cnt", scatter_kws={"s": 80});

In [None]:
plt.figure(figsize=(20,7))
sns.residplot( data=df, x="wind_speed", y="cnt", scatter_kws={"s": 80});

# THE END