### Group Prject - London Bike Rentals

In this project, you will work with the London Bikes dataset, which records daily bike rentals in the city along with key variables such as dates, weather conditions, and seasonality.

The goal is to apply the full data analytics workflow:

- Clean and prepare the dataset.

- Explore the data through visualisation.

- Construct and interpret confidence intervals.

- Build a regression model to explain variation in bike rentals.

- By the end, you will connect statistical concepts with practical Python analysis.

In [2]:
## Import libraries and data
## liner regression
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
import statsmodels.formula.api as smf

## hypothesis testing
import seaborn as sns
from scipy import stats

**1. Data Cleaning**

Check for missing values across columns. How would you handle them?

Inspect the date column and ensure it is correctly formatted as datetime. Extract useful features (year, month, day, day of week, season).

Convert categorical variables (e.g., season, weather) to appropriate categories in Python.

Ensure numeric columns (e.g., bikes rented, temperature) are in the right format.

In [21]:
## Your code goes here

## read ../Data/london_bikes.csv into a dataframe named london_bikes
london_bikes = pd.read_csv('../Data/london_bikes.csv')


In [None]:
london_bikes

Unnamed: 0,date,bikes_hired,year,wday,month,week,cloud_cover,humidity,pressure,radiation,precipitation,snow_depth,sunshine,mean_temp,min_temp,max_temp,weekend,day,season
0,2010-07-30 00:00:00+00:00,6897,2010,Fri,Jul,30,6.0,65.0,10147.0,157.0,22.0,,31.0,17.7,12.3,25.1,False,30,Summer
1,2010-07-31 00:00:00+00:00,5564,2010,Sat,Jul,30,5.0,70.0,10116.0,184.0,0.0,,47.0,21.1,17.0,23.9,True,31,Summer
2,2010-08-01 00:00:00+00:00,4303,2010,Sun,Aug,30,7.0,63.0,10132.0,89.0,0.0,,3.0,19.3,14.6,23.4,True,1,Summer
3,2010-08-02 00:00:00+00:00,6642,2010,Mon,Aug,31,7.0,59.0,10168.0,134.0,0.0,,20.0,19.5,15.6,23.6,False,2,Summer
4,2010-08-03 00:00:00+00:00,7966,2010,Tue,Aug,31,5.0,66.0,10157.0,169.0,0.0,,39.0,17.9,12.1,20.1,False,3,Summer
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4929,2024-01-27 00:00:00+00:00,16959,2024,Sat,Jan,4,4.0,,10331.0,39.0,0.0,0.0,21.0,4.5,,12.2,True,27,Winter
4930,2024-01-28 00:00:00+00:00,15540,2024,Sun,Jan,4,3.0,,10230.0,63.0,0.0,0.0,59.0,6.6,,12.5,True,28,Winter
4931,2024-01-29 00:00:00+00:00,22839,2024,Mon,Jan,5,8.0,,10222.0,18.0,0.0,0.0,0.0,8.8,,8.8,False,29,Winter
4932,2024-01-30 00:00:00+00:00,22303,2024,Tue,Jan,5,8.0,,10277.0,19.0,0.0,0.0,0.0,8.3,,12.0,False,30,Winter


In [22]:
# show columns with missing values
london_bikes.isnull().sum()
# fill missing values in cloud_cover, humidity, pressure and radiation with 0 but save as different dataframe
london_bikes_filled = london_bikes.fillna({'cloud_cover': 0, 'humidity': 0, 'pressure': 0, 'radiation': 0, 'snow_depth': 0})
# check if there are still missing values
london_bikes_filled.isnull().sum()

date              0
bikes_hired       0
year              0
wday              0
month             0
week              0
cloud_cover       0
humidity          0
pressure          0
radiation         0
precipitation    31
snow_depth        0
sunshine         31
mean_temp        31
min_temp         62
max_temp         31
weekend           0
dtype: int64

In [None]:
# Convert date columns to datetime format
london_bikes_filled['date'] = pd.to_datetime(london_bikes_filled['date'])
london_bikes['date'] = pd.to_datetime(london_bikes['date'])

# Extract month from date columns (string format)
london_bikes_filled['month'] = london_bikes_filled['date'].dt.strftime('%B').str[:3]
london_bikes['month'] = london_bikes['date'].dt.strftime('%B').str[:3]

# Extract year from date columns
london_bikes_filled['year'] = london_bikes_filled['date'].dt.year
london_bikes['year'] = london_bikes['date'].dt.year

# Extract day from date columns
london_bikes_filled['day'] = london_bikes_filled['date'].dt.day
london_bikes['day'] = london_bikes['date'].dt.day

# Create season columns based on month mapping
season_mapping = {
    'Dec': 'Winter', 'Jan': 'Winter', 'Feb': 'Winter',
    'Mar': 'Spring', 'Apr': 'Spring', 'May': 'Spring',
    'Jun': 'Summer', 'Jul': 'Summer', 'Aug': 'Summer',
    'Sep': 'Fall', 'Oct': 'Fall', 'Nov': 'Fall'
}
london_bikes_filled['season'] = london_bikes_filled['month'].map(season_mapping)
london_bikes['season'] = london_bikes['month'].map(season_mapping)

# convert season to categorical type
london_bikes_filled['season'] = pd.Categorical(london_bikes_filled['season'], categories=['Winter', 'Spring', 'Summer', 'Fall'], ordered=True)
london_bikes['season'] = pd.Categorical(london_bikes['season'], categories=['Winter', 'Spring', 'Summer', 'Fall'], ordered=True)

# convert month to categorical type
london_bikes_filled['month'] = pd.Categorical(london_bikes_filled['month'], categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], ordered=True)
london_bikes['month'] = pd.Categorical(london_bikes['month'], categories=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], ordered=True)

# convert day to categorical type
london_bikes_filled['day'] = pd.Categorical(london_bikes_filled['day'], categories=list(range(1, 32)), ordered=True)
london_bikes['day'] = pd.Categorical(london_bikes['day'], categories=list(range(1, 32)), ordered=True)

# convert year to categorical type, ranging from 2010 to 2025
london_bikes_filled['year'] = pd.Categorical(london_bikes_filled['year'], categories=list(range(2010, 2026)), ordered=True)
london_bikes['year'] = pd.Categorical(london_bikes['year'], categories=list(range(2010, 2026)), ordered=True)

# convert week to categorical type
london_bikes_filled['week'] = pd.Categorical(london_bikes_filled['week'], categories=list(range(1, 54)), ordered=True)
london_bikes['week'] = pd.Categorical(london_bikes['week'], categories=list(range(1, 54)), ordered=True)

# convert wday to categorical type
london_bikes_filled['wday'] = pd.Categorical(london_bikes_filled['wday'], categories=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], ordered=True)
london_bikes['wday'] = pd.Categorical(london_bikes['wday'], categories=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'], ordered=True)


# check if both dataframes are equal except for the missing values, highlight the differences
london_bikes.compare(london_bikes_filled)

london_bikes_filled.dtypes
london_bikes.dtypes


date             datetime64[ns, UTC]
bikes_hired                    int64
year                        category
wday                        category
month                       category
week                        category
cloud_cover                  float64
humidity                     float64
pressure                     float64
radiation                    float64
precipitation                float64
snow_depth                   float64
sunshine                     float64
mean_temp                    float64
min_temp                     float64
max_temp                     float64
weekend                         bool
day                         category
season                      category
dtype: object

Summary 1. 

1. created a new dataframe (london_bikes_filled) that contains no missing value (through filling by 0 method)
2. checked if the date's are correctly derived (they are)
3. added seasons, day of month
4. converted to categorical type: year, month, week, day, weekday
5. ensured numeric types are numerical
6. imported necessary libraries

**2. Exploratory Data Analysis (EDA)**

Plot the distribution of bikes rented.

Explore how rentals vary by season and month.

Investigate the relationship between temperature and bikes rented.

**Deliverables:**

At least 3 clear visualisations with captions.

A short written interpretation of key patterns (seasonality, weather effects, etc.).



In [None]:
## Your code goes here

**3. Construct 95% confidence intervals for the mean number of bikes rented per season.**

Repeat the calculation per month.

Interpret the result:

What range of values do you expect the true mean to lie in?

Which seasons/months have higher or lower average demand?

Are there overlaps in the intervals, and what does that mean?

**Deliverables:**

A table or plot showing the mean and confidence intervals.

A short interpretation.

In [None]:
## Your code goes here

**Regression Analysis**

What variables influence the number of bikes rented (y) and how? Build a regression model that best explains the variability in bikes rented.

**Interpret:**

Which predictors are significant?

What do the coefficients mean (in practical terms)?

How much of the variation in bike rentals is explained (R²)?

**Deliverables:**

Regression output table.

A short discussion of which factors matter most for predicting bike rentals.

In [None]:
#hello

## Deliverables
A knitted HTML, one person per group to submit